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Abstract 

The rapid increase in the performance of graphics hardware, coupled with recent improvements in its programma- 
bility, have made graphics hardware a compelling platform for computationally demanding tasks in a wide va- 
riety of application domains. In this report, we describe, summarize, and analyze the latest research in mapping 
general-purpose computation to graphics hardware. 

We begin with the technical motivations that underlie general-purpose computation on graphics processors 
(GPGPU) and describe the hardware and software developments that have led to the recent interest in this field. 
We then aim the main body of this report at two separate audiences. First, we describe the techniques used in 
mapping general-purpose computation to graphics hardware. We believe these techniques will be generally useful 
for researchers who plan to develop the next generation of GPGPU algorithms and techniques. Second, we survey 
and categorize the latest developments in general -purpose application development on graphics hardware. This 
survey should be of particular interest to researchers who are interested in using the latest GPGPU applications 
in their systems of interest. 

Categories and Subject Descriptors (according to ACM CCS): 1.3.1 [Computer Graphics]: Hardware Architecture; 
D.2.2 [Software Engineering]: Design Tools and Techniques 



1. Introduction: Why GPGPU? 

Commodity computer graphics chips are probably today's 
most powerful computational hardware for the dollar. These 
chips, known generically as Graphics Processing Units or 
GPUs, have gone from afterthought peripherals to modern, 
powerful, and programmable processors in their own right. 
Many researchers and developers have become interested in 
harnessing the power of commodity graphics hardware for 
general-purpose computing. Recent years have seen an ex- 
plosion in interest in such research efforts, known collec- 
tively as GPGPU (for "General Purpose GPU") computing. 
In this State of the Art Report we summarize the motiva- 
tion and essential developments in the hardware and soft- 
ware behind GPGPU. We give an overview of the techniques 
and computational building blocks used to map general- 
purpose computation to graphics hardware and provide a 
survey of the various general-purpose computing applica- 
tions to which GPUs have been applied. 

We begin by reviewing the motivation for and challenges 
of general purpose GPU computing. Why GPGPU? 
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1.1. Powerful and Inexpensive 

Recent graphics architectures provide tremendous memory 
bandwidth and computational horsepower. For example, the 
NVIDIA GeForce 6800 Ultra ($417 as of June 2005) can 
achieve a sustained 35.2 GB/sec of memory bandwidth; the 
ATI X800 XT ($447) can sustain over 63 GFLOPS (compare 
to 14.8 GFLOPS theoretical peak for a 3.7 GHz Intel Pen- 
tium4 SSE unit [Buc04]). GPUs are also on the cutting edge 
of processor technology; for example, the most recently an- 
nounced GPU at this writing contains over 300 million tran- 
sistors and is built on a 1 10-nanometer fabrication process. 

Not only is current graphics hardware fast, it is acceler- 
ating quickly. For example, the measured throughput of the 
GeForce 6800 is more than double that of the GeForce 5900, 
NVIDIA's previous flagship architecture. In general, the 
computational capabilities of GPUs, measured by the tradi- 
tional metrics of graphics performance, have compounded at 
an average yearly rate of 1 .7 x (pixels/second) to 2.3 x (ver- 
tices/second). This rate of growth outpaces the oft-quoted 
Moore's Law as applied to traditional microprocessors; com- 
pare to a yearly rate of roughly 1.4x for CPU perfor- 
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Figure 1: The programmable floating-point performance 
of GPUs (measured on the multiply-add instruction as 2 
floating-point operations per MAD) has increased dramat- 
ically over the last four years when compared to CPUs. Fig- 
ure courtesy lan Buck, Stanford University. 



mance [EWN05]. Put another way, graphics hardware per- 
formance is roughly doubling every six months (Figure 1). 

Why is the performance of graphics hardware increasing 
more rapidly than that of CPUs? After all, semiconductor ca- 
pability, driven by advances in fabrication technology, is in- 
creasing at the same rate for both platforms. The disparity in 
performance can be attributed to fundamental architectural 
differences: CPUs are optimized for high performance on 
sequential code, so many of their transistors are dedicated to 
supporting non-computational tasks like branch prediction 
and caching. On the other hand, the highly parallel nature of 
graphics computations enables GPUs to use additional tran- 
sistors for computation, achieving higher arithmetic inten- 
sity with the same transistor count. We discuss the architec- 
tural issues of GPU design further in Section 2. 

This computational power is available and inexpensive; 
these chips can be found in off-the-shelf graphics cards built 
for the PC video game market. A typical latest-generation 
card costs $400-500 at release and drops rapidly as new 
hardware emerges. 

1.2. Flexible and Programmable 

Modern graphics architectures have become flexible as 
well as powerful. Once fixed-function pipelines capable 
of outputting only 8-bit-per-channel color values, modern 
GPUs include fully programmable processing units that 
support vectorized floating-point operations at full IEEE 
single precision. High level languages have emerged to 
support the new programmability of the vertex and pixel 
pipelines [BFH*04, MGAK03, MTP*04]. Furthermore, ad- 
ditional levels of programmability are emerging with every 
major generation of GPU (roughly every 18 months). Exam- 
ples of major architectural changes in the current generation 
(as of this writing) GPUs include vertex texture access, full 



branching support in the vertex pipeline, and limited branch- 
ing capability in the fragment pipeline. The next generation 
is expected to expand on these changes and add "geome- 
try shaders", or programmable primitive assembly, bringing 
flexibility to an entirely new stage in the pipeline. In short, 
the raw speed, increased precision, and rapidly expanding 
programmability of the hardware make it an attractive plat- 
form for general-purpose computation. 



1.3. Limitations and Difficulties 

The GPU is hardly a computational panacea. The arithmetic 
power of the GPU is a result of its highly specialized archi- 
tecture, evolved over the years to extract the maximum per- 
formance on the highly parallel tasks of traditional computer 
graphics. The rapidly increasing flexibility of the graphics 
pipeline, coupled with some ingenious uses of that flexibil- 
ity by GPGPU developers, has enabled a great many appli- 
cations outside the original narrow tasks for which GPUs 
were originally designed, but many applications still exist for 
which GPUs are not (and likely never will be) well suited. 
Word processing, for example, is a classic example of a 
"pointer chasing" application, which is dominated by mem- 
ory communication and difficult to parallelize. 

Today's GPUs also lack some fundamental computing 
constructs, such as integer data operands. The lack of inte- 
gers and associated operations such as bit-shifts and bitwise 
logical operations (AND, OR, XOR, NOT) makes GPUs ill- 
suited for many computationally intense tasks such as cryp- 
tography. Finally, while the recent increase in precision to 
32-bit floating point has enabled a host of GPGPU applica- 
tions, 64-bit double precision arithmetic appears to be on the 
distant horizon at best. The lack of double precision hampers 
or prevents GPUs from being applicable to many very large- 
scale computational science problems. 

GPGPU computing presents challenges even for problems 
that map well to the GPU, because despite advances in pro- 
grammability and high-level languages, graphics hardware 
remains difficult to apply to non-graphics tasks. The GPU 
uses an unusual programming model (Section 2.3), so effec- 
tive GPU programming is not simply a matter of learning a 
new language, or writing a new compiler backend. Instead, 
the computation must be recast into graphics terms by a pro- 
grammer familiar with the underlying hardware, its design, 
limitations, and evolution. We emphasize that these difficul- 
ties are intrinsic to the nature of computer graphics hard- 
ware, not simply a result of immature technology. Computa- 
tional scientists cannot simply wait a generation or two for a 
graphics card with double precision and a FORTRAN com- 
piler. Today, harnessing the power of a GPU for scientific 
or general-purpose computation often requires a concerted 
effort by experts in both computer graphics and in the par- 
ticular scientific or engineering domain. But despite the pro- 
gramming challenges, the potential benefits — a leap forward 
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in computing capability, and a growth curve much faster than 
traditional CPUs — are too large to ignore. 

1.4. GPGPU Today 

An active, vibrant community of GPGPU developers has 
emerged (see http: / /GPGPU . org/), and many promis- 
ing early applications of GPGPU have appeared already in 
the literature. We give an overview of GPGPU applications, 
which range from numeric computing operations such as 
dense and sparse matrix multiplication techniques [KW03] 
or multigrid and conjugate-gradient solvers for systems of 
partial differential equations [BFGS03, GWL*03], to com- 
puter graphics processes such as ray tracing [PBMH02] 
and photon mapping [PDC*03] usually performed offline 
on the CPU, to physical simulations such as fluid mechan- 
ics solvers [BFGS03, Har04, KW03], to database and data 
mining operations [GLW*04, GRM05]. We cover these and 
more applications in Section 5. 

2. Overview of Programmable Graphics Hardware 

The emergence of general-purpose applications on graphics 
hardware has been driven by the rapid improvements in the 
programmability and performance of the underlying graph- 
ics hardware. In this section we will outline the evolution of 
the GPU and describe its current hardware and software. 

2.1. Overview of the Graphics Pipeline 

The application domain of interactive 3D graphics has sev- 
eral characteristics that differentiate it from more general 
computation domains. In particular, interactive 3D graph- 
ics applications require high computation rates and exhibit 
substantial parallelism. Building custom hardware that takes 
advantage of the native parallelism in the application, then, 
allows higher performance on graphics applications than can 
be obtained on more traditional microprocessors. 

All of today's commodity GPUs structure their graphics 
computation in a similar organization called the graphics 
pipeline. This pipeline is designed to allow hardware imple- 
mentations to maintain high computation rates through par- 
allel execution. The pipeline is divided into several stages; 
all geometric primitives pass through every stage. In hard- 
ware, each stage is implemented as a separate piece of hard- 
ware on the GPU in what is termed a task-parallel machine 
organization. Figure 2 shows the pipeline stages in current 
GPUs. 

The input to the pipeline is a list of geometry, expressed 
as vertices in object coordinates; the output is an image in 
a framebuffer. The first stage of the pipeline, the geometry 
stage, transforms each vertex from object space into screen 
space, assembles the vertices into triangles, and tradition- 
ally performs lighting calculations on each vertex. The out- 
put of the geometry stage is triangles in screen space. The 
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Figure 2: The modern graphics hardware pipeline. The ver- 
tex and fragment processor stages are both programmable 
by the user. 



next stage, rasterization, both determines the screen posi- 
tions covered by each triangle and interpolates per-vertex 
parameters across the triangle. The result of the rasteriza- 
tion stage is a fragment for each pixel location covered by 
a triangle. The third stage, the fragment stage, computes the 
color for each fragment, using the interpolated values from 
the geometry stage. This computation can use values from 
global memory in the form of textures; typically the frag- 
ment stage generates addresses into texture memory, fetches 
their associated texture values, and uses them to compute the 
fragment color. In the final stage, composition, fragments are 
assembled into an image of pixels, usually by choosing the 
closest fragment to the camera at each pixel location. This 
pipeline is described in more detail in the OpenGL Program- 
ming Guide [OSW*03]. 

2.2. Programmable Hardware 

As graphics hardware has become more powerful, one of the 
primary goals of each new generation of GPU has been to 
increase the visual realism of rendered images. The graph- 
ics pipeline described above was historically a fixed-function 
pipeline, where the limited number of operations available at 
each stage of the graphics pipeline were hardwired for spe- 
cific tasks. However, the success of offline rendering systems 
such as Pixar's RenderMan [Ups90] demonstrated the ben- 
efit of more flexible operations, particularly in the areas of 
lighting and shading. Instead of limiting lighting and shad- 
ing operations to a few fixed functions, RenderMan evalu- 
ated a user-defined shader program on each primitive, with 
impressive visual results. 

Over the past six years, graphics vendors have trans- 
formed the fixed-function pipeline into a more flexible pro- 
grammable pipeline. This effort has been primarily con- 
centrated on two stages of the graphics pipeline: the ge- 
ometry stage and the fragment stage. In the fixed-function 
pipeline, the geometry stage included operations on vertices 
such as transformations and lighting calculations. In the pro- 
grammable pipeline, these fixed-function operations are re- 
placed with a user-defined vertex program. Similarly, the 
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fixed-function operations on fragments that determine the 
fragment's color are replaced with a user-defined fragment 
program. 

Each new generation of GPUs has increased the function- 
ality and generality of these two programmable stages. 1999 
marked the introduction of the first programmable stage, 
NVIDIA's register combiner operations that allowed a lim- 
ited combination of texture and interpolated color values to 
compute a fragment color. In 2002, ATI's Radeon 9700 led 
the transition to floating-point computation in the fragment 
pipeline. 

The vital step for enabling general-purpose computation 
on GPUs was the introduction of fully programmable hard- 
ware and an assembly language for specifying programs 
to run on each vertex [LKM01] or fragment. This pro- 
grammable shader hardware is explicitly designed to pro- 
cess multiple data-parallel primitives at the same time. As of 
2005, the vertex shader and pixel shader standards are both 
in their third revision, and the OpenGL Architecture Review 
Board maintains extensions for both [Ope04, Ope03]. The 
instruction sets of each stage are limited compared to CPU 
instruction sets; they are primarily math operations, many of 
which are graphics-specific. The newest addition to the in- 
struction sets of these stages has been limited control flow 
operations. 

In general, these programmable stages input a limited 
number of 32-bit floating-point 4-vectors. The vertex stage 
outputs a limited number of 32-bit floating-point 4-vectors 
that will be interpolated by the rasterizer; the fragment 
stage outputs up to 4 floating-point 4-vectors, typically col- 
ors. Each programmable stage can access constant registers 
across all primitives and also read- write registers per primi- 
tive. The programmable stages have limits on their numbers 
of inputs, outputs, constants, registers, and instructions; with 
each new revision of the vertex shader and pixel [fragment] 
shader standard, these limits have increased. 

GPUs typically have multiple vertex and fragment pro- 
cessors (for example, the NVIDIA GeForce 6800 Ultra and 
ATI Radeon X800 XT each have 6 vertex and 16 fragment 
processors). Fragment processors have the ability to fetch 
data from textures, so they are capable of memory gather. 
However, the output address of a fragment is always deter- 
mined before the fragment is processed — the processor can- 
not change the output location of a pixel — so fragment pro- 
cessors are incapable of memory scatter. Vertex processors 
recently acquired texture capabilities, and they are capable 
of changing the position of input vertices, which ultimately 
affects where in the image pixels will be drawn. Thus, vertex 
processors are capable of both gather and scatter. Unfortu- 
nately, vertex scatter can lead to memory and rasterization 
coherence issues further down the pipeline. Combined with 
the lower performance of vertex processors, this limits the 
utility of vertex scatter in current GPUs. 



2.3. Introduction to the GPU Programming Model 

As we discussed in Section 1, GPUs are a compelling so- 
lution for applications that require high arithmetic rates 
and data bandwidths. GPUs achieve this high performance 
through data parallelism, which requires a programming 
model distinct from the traditional CPU sequential program- 
ming model. In this section, we briefly introduce the GPU 
programming model using both graphics API terminology 
and the terminology of the more abstract stream program- 
ming model, because both are common in the literature. 

The stream programming model exposes the parallelism 
and communication patterns inherent in the application 
by structuring data into streams and expressing compu- 
tation as arithmetic kernels that operate on streams. Pur- 
cell et al. [PBMH02] characterize their ray tracer in the 
stream programming model; Owens [Owe05] and Lefohn et 
al. [LKO05] discuss the stream programming model in the 
context of graphics hardware, and the Brook programming 
system [BFH*04] offers a stream programming system for 
GPUs. 

Because typical scenes have more fragments than ver- 
tices, in modern GPUs the programmable stage with the 
highest arithmetic rates is the fragment processor. A typical 
GPGPU program uses the fragment processor as the compu- 
tation engine in the GPU. Such a program is structured as 
follows [Har05a]: 

1. First, the programmer determines the data-parallel por- 
tions of his application. The application must be seg- 
mented into independent parallel sections. Each of these 
sections can be considered a kernel and is implemented 
as a fragment program. The input and output of each ker- 
nel program is one or more data arrays, which are stored 
(sometimes only transiently) in textures in GPU memory. 
In stream processing terms, the data in the textures com- 
prise streams, and a kernel is invoked in parallel on each 
stream element. 

2. To invoke a kernel, the range of the computation (or the 
size of the output stream) must be specified. The pro- 
grammer does this by passing vertices to the GPU. A 
typical GPGPU invocation is a quadrilateral (quad) ori- 
ented parallel to the image plane, sized to cover a rect- 
angular region of pixels matching the desired size of the 
output array. Note that GPUs excel at processing data in 
two-dimensional arrays, but are limited when processing 
one-dimensional arrays. 

3. The rasterizer generates a fragment for every pixel loca- 
tion in the quad, producing thousands to millions of frag- 
ments. 

4. Each of the generated fragments is then processed by the 
active kernel fragment program. Note that every frag- 
ment is processed by the same fragment program. The 
fragment program can read from arbitrary global mem- 
ory locations (with texture reads) but can only write to 
memory locations corresponding to the location of the 
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fragment in the frame buffer (as determined by the ras- 
terizer). The domain of the computation is specified for 
each input texture (stream) by specifying texture coordi- 
nates at each of the input vertices, which are then inter- 
polated at each generated fragment. Texture coordinates 
can be specified independently for each input texture, and 
can also be computed on the fly in the fragment program, 
allowing arbitrary memory addressing. 
5. The output of the fragment program is a value (or vec- 
tor of values) per fragment. This output may be the final 
result of the application, or it may be stored as a texture 
and then used in additional computations. Complex ap- 
plications may require several or even dozens of passes 
("multipass") through the pipeline. 

While the complexity of a single pass through the pipeline 
may be limited (for example, by the number of instructions, 
by the number of outputs allowed per pass, or by the limited 
control complexity allowed in a single pass), using multiple 
passes allows the implementation of programs of arbitrary 
complexity. For example, Peercy et al. [POAU00] demon- 
strated that even the fixed-function pipeline, given enough 
passes, can implement arbitrary RenderMan shaders. 

2.4. GPU Program Flow Control 

Flow control is a fundamental concept in computation. 
Branching and looping are such basic concepts that it can 
be daunting to write software for a platform that supports 
them to only a limited extent. The latest GPUs support vertex 
and fragment program branching in multiple forms, but their 
highly parallel nature requires care in how they are used. 
This section surveys some of the limitations of branching on 
current GPUs and describes a variety of techniques for iter- 
ation and decision-making in GPGPU programs. For more 
detail on GPU flow control, see Harris and Buck [HB05]. 

2.4.1. Hardware Mechanisms for Flow Control 

There are three basic implementations of data-parallel 
branching in use on current GPUs: predication, MIMD 
branching, and SIMD branching. 

Architectures that support only predication do not have 
true data-dependent branch instructions. Instead, the GPU 
evaluates both sides of the branch and then discards one of 
the results based on the value of the Boolean branch condi- 
tion. The disadvantage of predication is that evaluating both 
sides of the branch can be costly, but not all current GPUs 
have true data-dependent branching support. The compiler 
for high-level shading languages like Cg or the OpenGL 
Shading Language automatically generates predicated as- 
sembly language instructions if the target GPU supports only 
predication for flow control. 

In Multiple Instruction Multiple Data (MIMD) architec- 
tures that support branching, different processors can follow 
different paths through the program. In Single Instruction 

© The Eurographics Association 2005. 



Multiple Data (SIMD) architectures, all active processors 
must execute the same instructions at the same time. The 
only MIMD processors in a current GPU are the vertex pro- 
cessors of the NVIDIA GeForce 6 and NV40 Quadro GPUs. 
All current GPU fragment processors are SIMD. In SIMD, 
when evaluation of the branch condition is identical on all 
active processors, only the taken side of the branch must be 
evaluated, but if one or more of the processors evaluates the 
branch condition differently, then both sides must be evalu- 
ated and the results predicated. As a result, divergence in the 
branching of simultaneously processed fragments can lead 
to reduced performance. 

2.4.2. Moving Branching Up The Pipeline 

Because explicit branching can hamper performance on 
GPUs, it is useful to have multiple techniques to reduce the 
cost of branching. A useful strategy is to move flow-control 
decisions up the pipeline to an earlier stage where they can 
be more efficiently evaluated. 

2.4.2.1. Static Branch Resolution On the GPU, as on the 
CPU, avoiding branching inside inner loops is beneficial. 
For example, when evaluating a partial differential equa- 
tion (PDE) on a discrete spatial grid, an efficient implemen- 
tation divides the processing into multiple loops: one over 
the interior of the grid, excluding boundary cells, and one 
or more over the boundary edges. This static branch res- 
olution results in loops that contain efficient code without 
branches. (In stream processing terminology, this technique 
is typically referred to as the division of a stream into sub- 
streams) On the GPU, the computation is divided into two 
fragment programs: one for interior cells and one for bound- 
ary cells. The interior program is applied to the fragments 
of a quad drawn over all but the outer one-pixel edge of the 
output buffer. The boundary program is applied to fragments 
of lines drawn over the edge pixels. Static branch resolution 
is further discussed by Goodnight et al. [GWL*03], Hams 
and James [HJ03], and Lefohn et al. [LKHW03]. 

2.4.2.2. Pre-computation In the example above, the re- 
sult of a branch was constant over a large domain of input 
(or range of output) values. Similarly, sometimes the result 
of a branch is constant for a period of time or a number 
of iterations of a computation. In this case we can evalu- 
ate the branches only when the results are known to change, 
and store the results for use over many subsequent itera- 
tions. This can result in a large performance boost. This 
technique is used to pre-compute an obstacle offset array in 
the Navier-Stokes fluid simulation example in the NVIDIA 
SDK [Har05b]. 

2.4.2.3. Z-Cull Precomputed branch results can be taken 
a step further by using another GPU feature to entirely skip 
unnecessary work. Modern GPUs have a number of features 
designed to avoid shading pixels that will not be seen. One 
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of these is Z-cull. Z-cull is a hierarchical technique for com- 
paring the depth (Z) of an incoming block of fragments with 
the depth of the corresponding block of fragments in the Z- 
buffer. If the incoming fragments will all fail the depth test, 
then they are discarded before their pixel colors are calcu- 
lated in the fragment processor. Thus, only fragments that 
pass the depth test are processed, work is saved, and the ap- 
plication runs faster. In fluid simulation, "land-locked" ob- 
stacle cells can be "masked" with a z-value of zero so that 
all fluid simulation computations will be skipped for those 
cells. If the obstacles are fairly large, then a lot of work is 
saved by not processing these cells. Sander et al. described 
this technique [STM04] together with another Z-cull accel- 
eration technique for fluid simulation, and Harris and Buck 
provide pseudocode [HB05]. Z-cull was also used by Purcell 
et al. to accelerate GPU ray tracing [PBMH02]. 

2.4.2.4. Data-Dependent Looping With Occlusion 
Queries Another GPU feature designed to avoid drawing 
what is not visible is the hardware occlusion query (OQ). 
This feature provides the ability to query the number 
of pixels updated by a rendering call. These queries are 
pipelined, which means that they provide a way to get a 
limited amount of data (an integer count) back from the 
GPU without stalling the pipeline (which would occur when 
actual pixels are read back). Because GPGPU applications 
almost always draw quads with known pixel coverage, 
OQ can be used with fragment kill functionality to get 
a count of fragments updated and killed. This allows the 
implementation of global decisions controlled by the CPU 
based on GPU processing. Purcell et al. demonstrated this 
in their GPU ray tracer [PBMH02], and Harris and Buck 
provide pseudocode for the technique [HB05]. Occlusion 
queries can also be used for subdivision algorithms, such as 
the adaptive radiosity solution of Coombe et al. [CHL04], 



3. Programming Systems 

Successful programming for any development platform re- 
quires at least three basic components: a high-level language 
for code development, a debugging environment, and profil- 
ing tools. CPU programmers have a large number of well- 
established languages, debuggers, and profilers to choose 
from when writing applications. Conversely, GPU program- 
mers have just a small handful of languages to choose from, 
and few if any full-featured debuggers and profilers. 

In this section we look at the high-level languages that 
have been developed for GPU programming, and the debug- 
ging tools that are available for GPU programmers. Code 
profiling and tuning tends to be a very architecture-specific 
task. GPU architectures have evolved very rapidly, making 
profiling and tuning primarily the domain of the GPU man- 
ufacturer. As such, we will not discuss code profiling tools 
in this section. 



3.1. High-level Shading Languages 

Most high-level GPU programming languages today share 
one thing in common: they are designed around the idea that 
GPUs generate pictures. As such, the high-level program- 
ming languages are often referred to as shading languages. 
That is, they are a high-level language that compiles into a 
vertex shader and a fragment shader to produce the image 
described by the program. 

Cg [MGAK03], HLSL [Mic05a], and the OpenGL Shad- 
ing Language [KBR04] all abstract the capabilities of the 
underlying GPU and allow the programmer to write GPU 
programs in a more familiar C-like programming language. 
They do not stray far from their origins as languages de- 
signed to shade polygons. All retain graphics-specific con- 
structs: vertices, fragments, textures, etc. Cg and HLSL pro- 
vide abstractions that are very close to the hardware, with 
instruction sets that expand as the underlying hardware ca- 
pabilities expand. The OpenGL Shading Language was de- 
signed looking a bit further out, with many language features 
(e.g. integers) that do not directly map to hardware available 
today. 

Sh is a shading language implemented on top of 
C++ [MTP*04]. Sh provides a shader algebra for manipu- 
lating and defining procedurally parameterized shaders. Sh 
manages buffers and textures, and handles shader partition- 
ing into multiple passes. 

Finally, Ashli [BP03] works at a level one step above 
that of Cg, HLSL, or the OpenGL Shading Language. Ashli 
reads as input shaders written in HLSL, the OpenGL Shad- 
ing Language, or a subset of RenderMan. Ashli then auto- 
matically compiles and partitions the input shaders to run on 
a programmable GPU. 

3.2. GPGPU Languages and Libraries 

More often than not, the graphics-centric nature of shading 
languages makes GPGPU programming more difficult than 
it needs to be. As a simple example, initiating a GPGPU 
computation usually involves drawing a primitive. Looking 
up data from memory is done by issuing a texture fetch. The 
GPGPU program may conceptually have nothing to do with 
drawing geometric primitives and fetching textures, yet the 
shading languages described in the previous section force 
the GPGPU application writer to think in terms of geomet- 
ric primitives, fragments, and textures. Instead, GPGPU al- 
gorithms are often best described as memory and math op- 
erations, concepts much more familiar to CPU program- 
mers. The programming systems below attempt to provide 
GPGPU functionality while hiding the GPU-specific details 
from the programmer. 

The Brook programming language extends ANSI C with 
concepts from stream programming [BFH*04]. Brook can 
use the GPU as a compilation target. Brook streams are COn- 
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Figure 3: Examples of fragment program "printf" debug- 
ging. The left image encodes ray-object intersection hit 
points as r, g, b color. The right image draws a point at each 
location where a photon was stored in a photon map. (Im- 
ages generated by Purcell et al. [PDC*03].) 



ceptually similar to arrays, except all elements can be oper- 
ated on in parallel. Kernels are the functions that operate on 
streams. Brook automatically maps kernels and streams into 
fragment programs and texture memory. 

Scout is a GPU programming language designed for sci- 
entific visualization [MIA*04]. Scout allows runtime map- 
ping of mathematical operations over data sets for visualiza- 
tion. 

Finally, the Glift template library provides a generic 
template library for a wide range of GPU data struc- 
tures [LKS*05]. It is designed to be a stand-alone GPU data 
structure library that helps simplify data structure design and 
separate GPU algorithms from data structures. The library 
integrates with a C++, Cg, and OpenGL GPU development 
environment. 

3.3. Debugging Tools 

A high-level programming language gives a programmer the 
ability to create complex programs with much less effort 
than assembly language writing. With several high-level pro- 
gramming languages available to choose from, generating 
complex programs to run on the GPU is fairly straightfor- 
ward. But all good development platforms require more than 
just a language to write in. One of the most important tools 
needed for successful platforms is a debugger. Until recently, 
support for debugging on GPUs was fairly limited. 

The needs of a debugger for GPGPU programming are 
very similar to what traditional CPU debuggers provide, in- 
cluding variable watches, program break points, and single- 
step execution. GPU programs often involve user interaction. 
While a debugger does not need to run the application at full 
speed, the application being debugged should maintain some 
degree of interactivity. A GPU debugger should be easy to 
add to and remove from an existing application, should man- 
gle GPU state as little as possible, and should execute the de- 
bug code on the GPU, not in a software rasterizer. Finally, a 
GPU debugger should support the major GPU programming 
APIs and vendor-specific extensions. 

© The Eurographics Association 2005. 



A GPU debugger has a challenge in that it must be able 
to provide debug information for multiple vertices or pixels 
at a time. In many cases, graphically displaying the data for 
a given set of pixels gives a much better sense of whether 
a computation is correct than a text box full of numbers 
would. This visualization is essentially a "printf-style" de- 
bug, where the values of interest are printed to the screen. 
Figure 3 shows some examples of printf-style debugging 
that many GPGPU programmers have become adept at im- 
plementing as part of the debugging process. Drawing data 
values to the screen for visualization often requires some 
amount of scaling and biasing for values that don't fit in an 
8-bit color buffer (e.g. when rendering floating point data). 
The ideal GPGPU debugger would automate printf-style de- 
bugging, including programmable scale and bias, while also 
retaining the true data value at each point if it is needed. 

There are a few different systems for debugging GPU pro- 
grams available to use, but nearly all are missing one or more 
of the important features we just discussed. 

gDEBugger [Gra05] and GLIntercept [Tre05] are tools 
designed to help debug OpenGL programs. Both are able to 
capture and log OpenGL state from a program. gDEBugger 
allows a programmer to set breakpoints and watch OpenGL 
state variables at runtime. There is currently no specific sup- 
port for debugging shaders. GLIntercept does provide run- 
time shader editing, but again is lacking in shader debugging 
support. 

The Microsoft Shader Debugger [Mic05b], however, 
does provide runtime variable watches and breakpoints for 
shaders. The shader debugger is integrated into the Visual 
Studio IDE, and provides all the same functionality pro- 
grammers are used to for traditional programming. Unfortu- 
nately, debugging requires the shaders to be run in software 
emulation rather than on the hardware. In contrast, the Apple 
OpenGL Shader Builder [App05b] also has a sophisticated 
IDE and actually runs shaders in real time on the hardware 
during shader debug and edit. The downside to this tool is 
that it was designed for writing shaders, not for computation. 
The shaders are not run in the context of the application, but 
in a separate environment designed to help facilitate shader 
writing. 

While many of the tools mentioned so far provide a lot of 
useful features for debugging, none provide any support for 
shader data visualization or printf-style debugging. Some- 
times this is the single most useful tool for debugging pro- 
grams. The Image Debugger [Bax05] was among the first 
tools to provide this functionality by providing a printf-like 
function over a region of memory. The region of memory 
gets mapped to a display window, allowing a programmer 
to visualize any block of memory as an image. The Image 
Debugger does not provide any special support for shader 
programs, so programmers must write shaders such that the 
output gets mapped to an output buffer for visualization. 

The Shadesmith Fragment Program Debugger [PS03] was 
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the first system to automate printf-style debugging while 
providing basic shader debugging functionality like break- 
points and stepping. Shadesmith works by decomposing a 
fragment program into multiple independent shaders, one for 
each assembly instruction in the shader, then adding output 
instructions to each of these smaller programs. The effects 
of executing any instruction can be determined by running 
the right shader. Shadesmith automates the printf debug by 
running the appropriate shader for a register that is being 
watched, and drawing the output to an image window. Track- 
ing multiple registers is done by running multiple programs 
and displaying the results in separate windows. Shadesmith 
also provides the programmer the ability to write programs 
to arbitrarily scale and bias the watched registers. While 
Shadesmith represents a big step in the right direction for 
GPGPU debugging, it still has many limitations, the largest 
of which is that Shadesmith is currently limited to debug- 
ging assembly language shaders. GPGPU programs today 
are generally too complex for assembly level programming. 
Additionally, Shadesmith only works for OpenGL fragment 
programs, and provides no support for debugging OpenGL 
state. 

Finally, Duca et al. have recently described a system 
that not only provides debugging for graphics state but also 
both vertex and fragment programs [DNB* 05]. Their system 
builds a database of graphics state for which the user writes 
SQL-style queries. Based on the queries, the system extracts 
the necessary graphics state and program data and draws the 
appropriate data into a debugging window. The system is 
build on top of the Chromium [HHN*02] library, enabling 
debugging of any OpenGL applications without modifica- 
tion to the original source program. This promising approach 
combines graphics state debugging and program debugging 
with visualizations in a transparent and hardware-rendered 
approach. 



4. GPGPU Techniques 

This section is targeted at the developer of GPGPU libraries 
and applications. We enumerate the techniques required to 
efficiently map complex applications to the GPU and de- 
scribe some of the building blocks of GPU computation. 



4.1. Stream Operations 

Recall from Section 2.3 that the stream programming 
model is a useful abstraction for programming GPUs. There 
are several fundamental operations on streams that many 
GPGPU applications implement as a part of computing their 
final results: map, reduce, scatter and gather, stream filtering, 
sort, and search. In the following sections we define each of 
these operations, and briefly describe a GPU implementation 
for each. 



4.1.1. Map 

Perhaps the simplest operation, the map (or apply) operation 
operates just like a mapping function in Lisp. Given a stream 
of data elements and a function, map will apply the function 
to every element in the stream. A simple example of the map 
operator is applying scale and bias to a set of input data for 
display in a color buffer. 

The GPU implementation of map is straightforward. 
Since map is also the most fundamental operation to GPGPU 
applications, we will describe its GPU implementation in de- 
tail. In Section 2.3, we saw how to use the GPU's fragment 
processor as the computation engine for GPGPU. These five 
steps are the essence of the map implementation on the GPU. 
First, the programmer writes a function that gets applied to 
every element as a fragment program, and stores the stream 
of data elements in texture memory. The programmer then 
invokes the fragment program by rendering geometry that 
causes the rasterizer to generate a fragment for every pixel 
location in the specified geometry. The fragments are pro- 
cessed by the fragment processors, which apply the program 
to the input elements. The result of the fragment program 
execution is the result of the map operation. 

4.1.2. Reduce 

Sometimes a computation requires computing a smaller 
stream from a larger input stream, possibly to a single ele- 
ment stream. This type of computation is called a reduction. 
For example, a reduction can be used to compute the sum or 
maximum of all the elements in a stream. 

On GPUs, reductions can be performed by alternately ren- 
dering to and reading from a pair of textures. On each render- 
ing pass, the size of the output, the computational range, is 
reduced by one half. In general, we can compute a reduction 
over a set of data in 0(logn) steps using the parallel GPU 
hardware, compared to 0(n) steps for a sequential reduc- 
tion on the CPU. To produce each element of the output, a 
fragment program reads two values, one from a correspond- 
ing location on either half of the previous pass result buffer, 
and combines them using the reduction operator (for exam- 
ple, addition or maximum). These passes continue until the 
output is a one-by-one buffer, at which point we have our 
reduced result. For a two-dimensional reduction, the frag- 
ment program reads four elements from four quadrants of 
the input texture, and the output size is halved in both di- 
mensions at each step. Buck et al. describe GPU reductions 
in more detail in the context of the Brook programming lan- 
guage [BFH*04]. 

4.1.3. Scatter and Gather 

Two fundamental memory operations with which most pro- 
grammers are familiar are write and read. If the write and 
read operations access memory indirectly, they are called 
scatter and gather respectively. A scatter operation looks like 
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the C code d[a] = v where the value v is being stored 
into the data array d at address a. A gather operation is just 
the opposite of the scatter operation. The C code for gather 
looks like v = d [ a ] . 

The GPU implementation of gather is essentially a depen- 
dent texture fetch operation. A texture fetch from texture d 
with computed texture coordinates a performs the indirect 
memory read that defines gather. Unfortunately, scatter is not 
as straightforward to implement. Fragments have an implicit 
destination address associated with them: their location in 
frame buffer memory. A scatter operation would require that 
a program change the framebuffer write location of a given 
fragment, or would require a dependent texture write oper- 
ation. Since neither of these mechanisms exist on today's 
GPU, GPGPU programmers must resort to various tricks to 
achieve a scatter. These tricks include rewriting the problem 
in terms of gather; tagging data with final addresses during 
a traditional rendering pass and then sorting the data by ad- 
dress to achieve an effective scatter; and using the vertex 
processor to scatter (since vertex processing is inherently a 
scattering operation). Buck has described these mechanisms 
for changing scatter to gather in greater detail [Buc05]. 

4.1.4. Stream Filtering 

Many algorithms require the ability to select a subset of ele- 
ments from a stream, and discard the rest. This stream filter- 
ing operation is essentially a nonuniform reduction. These 
operations can not rely on standard reduction mechanisms, 
because the location and number of elements to be filtered 
is variable and not known a priori. Example algorithms that 
benefit from stream filtering include simple data partitioning 
(where the algorithm only needs to operate on stream ele- 
ments with positive keys and is free to discard negative keys) 
and collision detection (where only objects with intersecting 
bounding boxes need further computation). 

Horn has described a technique called stream com- 
paction [Hor05b] that implements stream filtering on the 
GPU. Using a combination of scan [HS86] and search, 
stream filtering can be achieved in 0(log/j) passes. 

4.1.5. Sort 

A sort operation allows us to transform an unordered set of 
data into an ordered set of data. Sorting is a classic algorith- 
mic problem that has been solved by several different tech- 
niques on the CPU. Unfortunately, nearly all of the classic 
sorting methods are not applicable to a clean GPU imple- 
mentation. The main reason these algorithms are not GPU 
friendly? Classic sorting algorithms are data-dependent and 
generally require scatter operations. Recall from Section 2.4 
that data dependent operations are difficult to implement ef- 
ficiently, and we just saw in Section 4.1.3 that scatter is 
not implemented for fragment processors on today's GPU. 
To make efficient use of GPU resources, a GPU-based sort 
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Figure 4: A simple parallel bitonic merge sort of eight ele- 
ments requires six passes. Elements at the head and tail of 
each arrow are compared, with larger elements moving to 
the head of the arrow. 



should be oblivious to the input data, and should not require 
scatter. 

Most GPU-based sorting implementations [BP04, 
CND03, KSW04, KW05, PDC*03,PurO4] have been based 
on sorting networks. The main idea behind a sorting network 
is that a given network configuration will sort input data 
in a fixed number of steps, regardless of the input data. 
Additionally, all the nodes in the network have a fixed com- 
munication path. The fixed communication pattern means 
the problem can be stated in terms of gather rather than a 
scatter, and the fixed number of stages for a given input size 
means the sort can be implemented without data-dependent 
branching. This yields an efficient GPU-based sort, with an 
0(n log 2 ft) complexity. 

Kipfer et al. and Purcell et al. implement a bitonic merge 
sort [Bat68] and Callele et al. use a periodic balanced sort- 
ing network [DPRS89]. The implementation details of each 
technique vary, but the high level strategy for each is the 
same. The data to be sorted is stored in texture memory. Each 
of the fixed number of stages for the sort is implemented as 
a fragment program that does a compare-and-swap opera- 
tion. The fragment program simply fetches two texture val- 
ues, and based on the sort parameters, determines which of 
them to write out for the next pass. Figure 4 shows a simple 
bitonic merge sort. 

Sorting networks can also be implemented efficiently us- 
ing the texture mapping and blending functionalities of the 
GPU [GRM05]. In each step of the sorting network, a com- 
parator mapping is created at each pixel on the screen and 
the color of the pixel is compared against exactly one other 
pixel. The comparison operations are implemented using the 
blending functionality and the comparator mapping is imple- 
mented using the texture mapping hardware, thus entirely 
eliminating the need for fragment programs. Govindaraju 
et al. [GRH*05] have also analyzed the cache-efficiency 
of sorting network algorithms and presented an improved 
bitonic sorting network algorithm with a better data access 
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Figure 5: Performance of CPU-based and GPU-based sort- 
ing algorithms on IEEE 16-bit floating point values. The 
CPU-based Qsort available in the Intel compiler is opti- 
mized using hyperthreading and SSE instructions. We ob- 
serve that the cache-efficient GPU-based sorting network 
algorithm is nearly 6 times faster than the optimized CPU 
implementation on a 3.4 GHz PC with an NVIDIA GeForce 
6800 Ultra GPU. Furthermore, the fixed-function pipeline 
implementation described by Govindaraju et al. [GRH*05] 
is nearly 1.2 times faster than their implementation with 
fragment programs. 



pattern and data layout. The precision of the underlying sort- 
ing algorithm using comparisons with fixed function blend- 
ing hardware is limited to the precision of the blending hard- 
ware and is limited on current hardware to IEEE 16-bit float- 
ing point values. Alternatively, the limitation to IEEE 16-bit 
values on current GPUs can be alleviated by using a single- 
line fragment program for evaluating the conditionals, but 
the fragment program implementation on current GPUs is 
nearly 1 .2 times slower than the fixed function pipeline. Fig- 
ure 5 highlights the performance of different GPU-based and 
CPU-based sorting algorithms on different sequences com- 
posed of IEEE 16-bit floating point values using a PC with a 
3.4 GHz Pentium 4 CPU and an NVIDIA GeForce 6800 Ul- 
tra GPU. A sorting library implementing the algorithm for 
16-bit and 32-bit floats is freely available for noncommer- 
cial use [GPU05] . 

GPUs have also been used to efficiently perform 1-D and 
3-D adaptive sorting of sequences [GHLM05]. Unlike sort- 
ing network algorithms, the computational complexity of 
adaptive sorting algorithms is dependent on the extent of dis- 
order in the input sequence, and work well for nearly-sorted 
sequences. The extent of disorder is computed using Knuth's 
measure of disorder. Given an input sequence /, the measure 
of disorder is defined as the minimal number of elements 
that need to be removed for the rest of the sequence to re- 
main sorted. The algorithm proceeds in multiple iterations. 
In each iteration, the unsorted sequence is scanned twice. 



In the first pass, the sequence is scanned from the last el- 
ement to the first, and an increasing sequence of elements 
M is constructed by comparing each element with the cur- 
rent minimum. In the second pass, the sorted elements in 
the increasing sequence are computed by comparing each 
element in M against the current minimum in / — M. The 
overall algorithm is simple and requires only comparisons 
against the minimum of a set of values. The algorithm is 
therefore useful for fast 3D visibility ordering of elements 
where the minimum comparisons are implemented using the 
depth buffer [GHLM05]. 

4.1.6. Search 

The last stream operation we discuss, search, allows us to 
find a particular element within a stream. Search can also 
be used to find the set of nearest neighbors to a specified 
element. Nearest neighbor search is used extensively when 
computing radiance estimates in photon mapping (see Sec- 
tion 5.4.2) and in database queries (e.g. find the 10 nearest 
restaurants to point X). When searching, we will use the par- 
allelism of the GPU not to decrease the latency of a single 
search, but rather to increase search throughput by executing 
multiple searches in parallel. 

Binary Search The simplest form of search is the binary 
search. This is a basic algorithm, where an element is lo- 
cated in a sorted list in 0(logn) time. Binary search works 
by comparing the center element of a list with the element 
being searched for. Depending on the result of the compar- 
ison, the search then recursively examines the left or right 
half of the list until the element is found, or is determined 
not to exist. 

The GPU implementation of binary search [Hor05b, 
PDC*03, Pur04] is a straightforward mapping of the stan- 
dard CPU algorithm to the GPU. Binary search is inher- 
ently serial, so we can not parallelize lookup of a single el- 
ement. That means only a single pixel's worth of work is 
done for a binary search. We can easily perform multiple bi- 
nary searches on the same data in parallel by sending more 
fragments through the search program. 

Nearest Neighbor Search Nearest neighbor search is a 
slightly more complicated form of search. In this search, 
we want to find the k nearest neighbors to a given element. 
On the CPU, this has traditionally been done using a k-d 
tree [Ben75]. During a nearest neighbor search, candidate 
elements are maintained in a priority queue, ordered by dis- 
tance from the "seed" element. At the end of the search, the 
queue contains the nearest neighbors to the seed element. 

Unfortunately, the GPU implementation of nearest neigh- 
bor search is not as straightforward. We can search a k-d tree 
data structure [FS05], but we have not yet found a way to 
efficiently maintain a priority queue. The important detail 
about the priority queue is that candidate neighbors can be 
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removed from the queue if closer neighbors are found. Pur- 
cell et al. propose a data structure for finding nearest neigh- 
bors called the kNN-grid [PDC*03]. The grid approximates 
a nearest-neighbor search, but is unable to reject candidate 
neighbors once they are added to the list. The quality of the 
search then depends on the density of the grid and the order 
candidate neighbors are visited during the search. The details 
of the kNN-grid implementation are beyond the scope of this 
paper, and readers are encouraged to review the original pa- 
pers for more details [PDC*03,Pur04]. The next section of 
this report discusses GPGPU data structures like arrays and 
the kNN-grid. 

4.2. Data Structures 

Every GPGPU algorithm must operate on data stored in an 
appropriate structure. This section describes the data struc- 
tures used thus far for GPU computation. Effective GPGPU 
data structures must support fast and coherent parallel ac- 
cesses as well as efficient parallel iteration, and must also 
work within the constraints of the GPU memory model. We 
first describe this model and explain common patterns seen 
in many GPGPU structures, then present data structures un- 
der three broad categories: dense arrays, sparse arrays, and 
adaptive arrays. Lefohn et al. [LKO05,LKS*05] give a more 
detailed overview of GPGPU data structures and the GPU 
memory model. 

The GPU Memory Model Before describing GPGPU data 
structures, we briefly describe the memory primitives with 
which they are built. As described in Section 2.3, GPU data 
are almost always stored in texture memory. To maintain par- 
allelism, operations on these textures are limited to read-only 
or write-only access within a kernel. Write access is further 
limited by the lack of scatter support (see Section 4.1.3). 
Outside of kernels, users may allocate or delete textures, 
copy data between the CPU and GPU, copy data between 
GPU textures, or bind textures for kernel access. Lastly, most 
GPGPU data structures are built using 2D textures for two 
reasons. First, the maximum ID texture size is often too 
small to be useful and second, current GPUs cannot effi- 
ciently write to a slice of a 3D texture. 

Iteration In modern C/C++ programming, algorithms are 
defined in terms of iteration over the elements of a data 
structure. The stream programming model described in Sec- 
tion 2.3 performs an implicit data- parallel iteration over a 
stream. Iteration over a dense set of elements is usually ac- 
complished by drawing a single large quad. This is the com- 
putation model supported by Brook, Sh, and Scout. Complex 
structures, however, such as sparse arrays, adaptive arrays, 
and grid-of-list structures often require more complex iter- 
ation constructs [BFGS03, KW03, LKHW04]. These range 
iterators are usually defined using numerous smaller quads, 
lines, or point sprites. 
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Generalized Arrays via Address Translation The major- 
ity of data structures used thus far in GPGPU programming 
are random-access multidimensional containers, including 
dense arrays, sparse arrays, and adaptive arrays. Lefohn 
et al. [LKS*05] show that these virtualized grid structures 
share a common design pattern. Each structure defines a vir- 
tual grid domain (the problem space), a physical grid domain 
(usually a 2D texture), and an address translator between the 
two domains. A simple example is a ID array represented 
with a 2D texture. In this case, the virtual domain is ID, the 
physical domain is 2D, and the address translator converts 
between them [LKO05,PBMH02], 

In order to provide programmers with the abstraction 
of iterating over elements in the virtual domain, GPGPU 
data structures must support both virtual-to-physical and 
physical-to-virtual address translation. For example, in the 
ID array example above, an algorithm reads from the ID 
array using a virtual-to-physical (lD-to-2D) translation. An 
algorithm that writes to the array, however, must convert 
the 2D pixel (physical) position of each stream element to 
a ID virtual address before performing computations on 
ID addresses. A number of authors describe optimization 
techniques for pre-computing these address translation op- 
erations before the fragment processor [BFGS03, CHL04, 
KW03,LKHW04]. These optimizations pre-compute the ad- 
dress translation using the CPU, the vertex processor, and/or 
the rasterizer. 

The Brook programming systems provide virtualized in- 
terfaces to most GPU memory operations for contiguous, 
multidimensional arrays. Sh provides a subset of the op- 
erations for large ID arrays. The Glift template library 
provides virtualized interfaces to GPU memory opera- 
tions for any structure that can be defined using the pro- 
grammable address translation paradigm. These systems 
also define iteration constructs over their respective data 
structures [BFH*04,LKS*05,MTP*04]. 



4.2.1. Dense Arrays 

The most common GPGPU data structure is a contigu- 
ous multidimensional array. These arrays are often imple- 
mented by first mapping from N-D to ID, then from ID 
to 2D [BFH*04,PBMH02]. For 3D-to-2D mappings, Harris 
et al. describe an alternate representation, flat 3D textures, 
that directly maps the 2D slices of the 3D array to 2D mem- 
ory [HBSL03]. Figures 6 and 7 show diagrams of these ap- 
proaches. 

Iteration over dense arrays is performed by drawing large 
quads that span the range of elements requiring computa- 
tion. Brook, Glift, and Sh provide users with fully virtualized 
CPU/GPU interfaces to these structures. Lefohn et al. give 
code examples for optimized implementations [LKO05]. 
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Figure 6: GPU-based multidimensional arrays usually store 
data in 2D texture memory. Address translators for N-D ar- 
rays generally convert N-D addresses to ID, then to 2D. 




Figure 7: For the special case of 3D-to-2D conversions or 
flat 3D textures, 2D slices of the 3D array are packed into a 
single 2D texture. This structure maintains 2D locality and 
therefore supports native bilinear filtering. 



4.2.2. Sparse Arrays 

Sparse arrays are multidimensional structures that store only 
a subset of the grid elements defined by their virtual domain. 
Example uses include sparse matrices and implicit surface 
representations. 

Static Sparse Arrays We define static to mean that the 
number and position of stored (non-zero) elements does not 
change throughout GPU computation, although the GPU 
computation may update the value of the stored elements. A 
common application of static sparse arrays is sparse matri- 
ces. These structures can use complex, pre-computed pack- 
ing schemes to represent the active elements because the 
structure does not change. 

Sparse matrix structures were first presented by Bolz et 
al. [BFGS03] and Kriiger et al. [KW03], Bolz et al. treat 
each row of a sparse matrix as a separate stream and pack the 
rows into a single texture. They simultaneously iterate over 
all rows containing the same number of non-zero elements 
by drawing a separate small quad for each row. They perform 
the physical-to-virtual and virtual-to-physical address trans- 
lations in the fragment stage using a two-level lookup table. 
In contrast, for random sparse matrices, Kriiger et al. pack 
all active elements into vertex buffers and iterate over the 
structure by drawing a single-pixel point for each element. 
Each point contains a pre-computed virtual address. Kriiger 
et al. also describe a packed texture format for banded sparse 
matrices. Buck et al. later introduced a sparse matrix Brook 
example application that performs address translation with 




Virtual Domain Page Table Physical Memory 



Figure 8: Page table address data structures can be used to 
represent dynamic sparse or adaptive GPGPU data struc- 
tures. For sparse arrays, page tables map only a subset of 
possible pages to texture memory. Page-table-based adap- 
tive arrays map either uniformly sized physical pages to 
a varying number of virtual pages or vice versa. Page ta- 
bles consume more memory than a tree structure but offer 
constant-time memory accesses and support efficient data- 
parallel insertion and deletion of pages. Example applica- 
tions include ray tracing acceleration structures, adaptive 
shadow maps, and deformable implicit surfaces [LKHW04, 
LSK*05, PBMH02]. Lefohn et al. describe these structures 
in detail [LKS* 05]. 



only a single level of indirection. The scheme packs the non- 
zero elements of each row into identically sized streams. 
As such, the approach applies to sparse matrices where all 
rows contain approximately the same number of non-zero 
elements. See Section 4.4 for more detail about GPGPU lin- 
ear algebra. 

Dynamic Sparse Arrays Dynamic sparse arrays are similar 
to those described in the previous section but support inser- 
tion and deletion of non-zero elements during GPU compu- 
tation. An example application for a dynamic sparse array is 
the data structure for a deforming implicit surface. 

Multidimensional page table address translators are an at- 
tractive option for dynamic sparse (and adaptive) arrays be- 
cause they provide fast data access and can be easily up- 
dated. Like the page tables used in modern CPU architec- 
tures and operating systems, page table data structures en- 
able sparse mappings by mapping only a subset of possible 
pages into physical memory. Page table address translators 
support constant access time and storage proportional to the 
number of elements in the virtual address space. The transla- 
tions always require the same number of instructions and are 
therefore compatible with the current fragment processor's 
SIMD architecture. Figure 8 shows a diagram of a sparse 2D 
page table structure. 

Lefohn et al. represent a sparse dynamic volume using a 
CPU-based 3D page table with uniformly-sized 2D physical 
pages [LKHW04]. They stored the page table on the CPU, 
the physical data on the GPU, and pre-compute all address 
translations using the CPU, vertex processor, and rasterizer. 
The GPU creates page allocations and deletion request by 
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rendering a small bit vector message. The CPU decodes this 
message and performs the requested memory management 
operations. Strzodka et al. use a page discretization and sim- 
ilar message-passing mechanism to define sparse iteration 
over a dense array [ST04]. 

4.2.3. Adaptive Structures 

Adaptive arrays are a generalization of sparse arrays and rep- 
resent structures such as quadtrees, octrees, kNN-grids, and 
k-d trees. These structures non-uniformly map data to the 
virtual domain and are useful for very sparse or multiresolu- 
tion data. Similar to their CPU counterparts, GPGPU adap- 
tive address translators are represented with a tree, a page 
table, or a hash table. Example applications include ray trac- 
ing acceleration structures, photon maps, adaptive shadow 
maps, and octree textures. 

Static Adaptive Structures 

Purcell et al. use a static adaptive array to represent a 
uniform-grid ray tracing acceleration structure [PBMH02]. 
The structure uses a one-level, 3D page table address trans- 
lator with varying-size physical pages. A CPU-based pre- 
process packs data into the varying-size pages and stores the 
page size and page origin in the 3D page table. The ray tracer 
advances rays through the page table using a 3D line draw- 
ing algorithm. Rays traverse the variable-length triangle lists 
one render pass at a time. The conditional execution tech- 
niques described in Section 2.4 are used to avoid performing 
computation on rays that have reached the end of the triangle 
list. 

Foley et al. recently introduced the first k-d tree for GPU 
ray tracing [FS05]. A k-d tree adaptively subdivides space 
into axis-aligned bounding boxes whose size and position 
are determined by the data rather than a fixed grid. Like the 
uniform grid structure, the query input for their structure is 
the ray origin and direction and the result is the origin and 
size of a triangle list. In their implementation, a CPU-based 
pre-process creates the k-d tree address translator and packs 
the triangle lists into texture memory. They present two new 
k-d tree traversal algorithms that are GPU-compatible and, 
unlike previous algorithms, do not require the use of a stack. 

Dynamic Adaptive Arrays Purcell et al. introduced the 
first dynamic adaptive GPU array, the kNN-grid photon 
map [PDC*03]. The structure uses a one-level page table 
with either variable-sized or fixed-sized pages. They update 
the variable-page-size version by sorting data elements and 
searching for the beginning of each page. The fixed-page- 
size variant limits the number of data elements per page but 
avoids the costly sorting and searching steps. 

Lefohn et al. use a mipmap hierarchy of page ta- 
bles to define quadtree-like and octree-like dynamic struc- 
tures [LKS*05,LSK*05]. They apply the structures to GPU- 
based adaptive shadow mapping and dynamic octree tex- 
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Figure 9: Tree-based address translators can be used in 
place of page tables to represent adaptive data structures 
such as quadtrees, octrees and k-d trees [FS05, LHN05]. 
Trees consume less memory than page table structures but 
result in longer access times and are more costly to incre- 
mentally update. 



tures. The structure achieves adaptivity by mapping a vary- 
ing number of virtual pages to uniformly sized physical 
pages. The page tables consume more memory than a tree- 
based approach but support constant-time accesses and can 
be efficiently updated by the GPU. The structures support 
data-parallel iteration over the active elements by drawing 
a point sprite for each mapped page and using the vertex 
processor and rasterizer to pre-compute physical-to-virtual 
address translations. 

In the limit, multilevel page tables are synonymous with 
N-tree structures. Coombe et al. and Lefebvre et al. de- 
scribe dynamic tree-based structures [CHL04,LHN05]. Tree 
address translators consume less memory than a page ta- 
ble (O(logA')), but result in slower access times (O(logA')) 
and require non-uniform (non-SIMD) computation. Coombe 
et al. use a CPU-based quadtree translator [CHL04] while 
Lefebvre et al. describe a GPU-based octree-like transla- 
tor [LHN05]. Figure 9 depicts a tree-based address trans- 
lator. 

4.2.4. Non-Indexable Structures 

All the structures discussed thus far support random ac- 
cess and therefore trivially support data-parallel accesses. 
Nonetheless, researchers are beginning to explore non- 
indexable structures. Ernst et al. and Lefohn et al. both de- 
scribe GPU-based stacks [EVG04,LKS*05]. 

Efficient dynamic parallel data structures are an active 
area of research. For example, structures such as priority 
queues (see Section 4.1.6), sets, linked lists, and hash ta- 
bles have not yet been demonstrated on GPUs. While sev- 
eral dynamic adaptive tree-like structures have been imple- 
mented, many open problems remain in efficiently building 
and modifying these structures, and many structures (e.g., 
k-d trees) have not yet been constructed on the GPU. Con- 
tinued research in understanding the generic components of 
GPU data structures may also lead to the specification of 
generic algorithms, such as in those described in Section 4. 1 . 
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4.3. Differential Equations 

Differential equations arise in many disciplines of science 
and engineering. Their efficient solution is necessary for ev- 
erything from simulating physics for games to detecting fea- 
tures in medical imaging. Typically differential equations 
are solved for entire arrays of input. For example, physi- 
cally based simulations of heat transfer or fluid flow typi- 
cally solve a system of equations representing the temper- 
ature or velocity sampled over a spatial domain. This sam- 
pling means that there is high data parallelism in these prob- 
lems, which makes them suitable for GPU implementation. 




Figure 10: Solving the wave equation PDE on the GPU al- 
lows for fast and stable rendering of water surfaces. (Image 
generated by Kriiger et al. [KW03 ]) 



There are two main classes of differential equations: or- 
dinary differential equations (ODEs) and partial differen- 
tial equations (PDEs). An ODE is an equality involving a 
function and its derivatives. An ODE of order n is an equa- 
tion of the form F(x,y,y' , ■ ■ ■ ,y'"') = 0 where is the 
nlh derivative with respect to x. PDEs, on the other hand, 
are equations involving functions and their partial deriva- 
tives, like the wave equation y$ + ^Jf + ^pf = (see 
Figure 10). ODEs typically arise in the simulation of the 
motion of objects, and this is where GPUs have been ap- 
plied to their solution. Particle system simulation involves 
moving many point particles according to local and global 
forces. This results in simple ODEs that can be solved via 
explicit integration (most have used the well-known Euler, 
Midpoint, or Runge-Kutta methods). This is relatively sim- 
ple to implement on the GPU: a simple fragment program is 
used to update each particle's position and velocity, which 
are stored as 3D vectors in textures. Kipfer et al. presented a 
method for simulating particle systems on the GPU includ- 
ing inter-particle collisions by using the GPU to quickly sort 
the particles to determine potential colliding pairs [KSW04]. 
In simultaneous work, Kolb et al. produced a GPU particle 
system simulator that supported accurate collisions of parti- 
cles with scene geometry by using GPU depth comparisons 
to detect penetration [KLRS04]. Kriiger et al. presented a 



scientific flow exploration system that supports a wide vari- 
ety of of visualization geometries computed entirely on the 
GPU [KKKW05] (see Figure 11). A simple GPU particle 
system example is provided in the NVIDIA SDK [Gre04]. 
Nyland et al. extended this example to add /7-body gravita- 
tional force computation [NHP04]. Related to particle sys- 
tems is cloth simulation. Green demonstrated a very simple 
GPU cloth simulation using Verlet integration [Ver67] with 
basic orthogonal grid constraints [Gre03]. Zeller extended 
this with shear constraints which can be interactively bro- 
ken by the user to simulate cutting of the cloth into multiple 
pieces [Zel05]. 




Figure 11: GPU-computed stream ribbons in a 3D flow 
field. The entire process from vectorfield interpolation and 
integration to curl computation, and finally geometry gener- 
ation and rendering of the stream ribbons, is performed on 
the GPU [KKKW05]. 

When solving PDEs, the two common methods of sam- 
pling the domain of the problem are finite differences and 
finite element methods (FEM). The former has been much 
more common in GPU applications due to the natural map- 
ping of regular grids to the texture sampling hardware 
of GPUs. Most of this work has focused on solving the 
pressure-Poisson equation that arises in the discrete form of 
the Navier-Stokes equations for incompressible fluid flow. 
Among the numerical methods used to solve these sys- 
tems are the conjugate gradient method [GV96] (Bolz et 
al. [BFGS03] and Kriiger and Westermann [KW03]), the 
multigrid method [BHM00] (Bolz et al. [BFGS03] and 
Goodnight et al. [GWL*03]), and simple Jacobi and red- 
black Gauss-Seidel iteration (Harris et al. [HBSL03]). 

The earliest work on using GPUs to solve PDEs was 
done by Rumpf and Strzodka, who mapped mathemati- 
cal structures like matrices and vectors to textures and lin- 
ear algebra operations to GPU features such as blending 
and the OpenGL imaging subset. They applied the GPU 
to segmentation and non-linear diffusion in image process- 
ing [RSOlb, RSOla] and used GPUs to solve finite ele- 
ment discretizations of PDEs like the anisotropic heat equa- 
tion [RSOlc]. Recent work by Rumpf and Strzodka [RS05] 
discusses the use of Finite Element schemes for PDE solvers 
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on GPUs in detail. Lefohn et al. applied GPUs to the solution 
of sparse, non-linear PDEs (level-set equations) for volume 
segmentation [LW02,Lef03]. 

4.4. Linear Algebra 

As GPU flexibility has increased over the last decade, re- 
searchers were quick to realize that many linear algebraic 
problems map very well to the pipelined SIMD hardware in 
these processors. Furthermore, linear algebra techniques are 
of special interest for many real-time visual effects important 
in computer graphics. A particularly good example is fluid 
simulation (Section 5.2), for which the results of the numer- 
ical computation can be computed in and displayed directly 
from GPU memory. 

Larsen and McAllister described an early pre-floating- 
point implementation of matrix multiplies. Adopting a tech- 
nique from parallel computing that distributes the computa- 
tion over a logically cube-shaped lattice of processors, they 
used 2D textures and simple blending operations to perform 
the matrix product [LM01]. Thompson et al. proposed a gen- 
eral computation framework running on the GPU vertex pro- 
cessor; among other test cases they implemented some linear 
algebra operations and compared the timings to CPU imple- 
mentations. Their test showed that especially for large ma- 
trices a GPU implementation has the potential to outperform 
optimized CPU solutions [THO02]. 

With the availability of 32-bit IEEE floating point textures 
and more sophisticated shader functionality in 2003, Hilles- 
land et al. presented numerical solution techniques to least 
squares problems [HMG03]. Bolz et al. [BFGS03] presented 
a representation for matrices and vectors. They implemented 
a sparse matrix conjugate gradient solver and a regular- 
grid multigrid solver for GPUs, and demonstrated the effec- 
tiveness of their approach by using these solvers for mesh 
smoothing and solving the incompressible Navier-Stokes 
equations. Goodnight et al. presented another multigrid 
solver; their solution focused on an improved memory layout 
of the domain [GWL*03] that avoids the context-switching 
latency that arose with the use of OpenGL pbuffers. 

Other implementations avoided this pbuffer latency by 
using the DirectX API. Moravanszky [Mor02] proposed a 
GPU-based linear algebra system for the efficient repre- 
sentation of dense matrices. Kriiger and Westermann took 
a broader approach and presented a general linear algebra 
framework supporting basic operations on GPU-optimized 
representations of vectors, dense matrices, and multiple 
types of sparse matrices [KW03]. Their implementation was 
based on a 2D texture representation for vectors in which a 
vector is laid out into the RGB A components of a 2D texture. 
A matrix was composed of such vector textures, either split 
column-wise for dense matrices or diagonally for banded 
sparse matrices. With this representation a component-wise 
vector-vector operation — add, multiply, and so on — requires 
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rendering only one quad textured with the two input vec- 
tor textures with a short shader that does two fetches into 
the input textures, combines the results (e.g. add or multi- 
ply), and outputs this to a new texture representing the result 
vector. A matrix-vector operation in turn is executed as mul- 
tiple vector-vector operations: the columns or diagonals are 
multiplied with the vector one at a time and are added to the 
result vector. In this way a five-banded matrix — for instance, 
occurring in the Poisson equation of the Navier-Stokes fluid 
simulation — can be multiplied with a vector by rendering 
only five quads. The set of basic operations is completed by 
the reduce operation, which computes single values out of 
vectors e.g., the sum of all vector elements (Section 4.1.2). 

Using this set of operations, encapsulated into C++ 
classes, Kriiger and Westermann enabled more complex al- 
gorithms to be built without knowledge of the underlying 
GPU implementation [KW03]. For example, a conjugate 
gradient solver was implemented with fewer than 20 lines 
of C++ code. This solver in turn can be used for the solution 
of PDEs such as the Navier-Stokes equations for fluid flow 
(see Figure 12). 




Figure 12: This image shows a 2D Navier-Stokes fluid 
flow simulation with arbitrary obstacles. It runs on a stag- 
gered 512 by 128 grid. Even with additional features like 
vorticity confinement enabled, such simulations perform at 
about 200 fps on current GPUs such as ATI's Radeon X800 
[KW03]. 



Apart from their applications in numerical simulation, 
linear algebra operators can be used for GPU perfor- 
mance evaluation and comparison to CPUs. For instance 
Brook [BFH*04] featured a spMatrixVec test that used a 
padded compressed sparse row format. 

A general evaluation of the suitability of GPUs for linear 
algebra operations was done by Fatahalian et al. [FSH04]. 
They focused on matrix-matrix multiplication and discov- 
ered that these operations are strongly limited by mem- 
ory bandwidth when implemented on the GPU. They ex- 
plained the reasons for this behavior and proposed architec- 
tural changes to further improve GPU linear algebra perfor- 
mance. To better adapt to such future hardware changes and 
to address vendor-specific hardware differences, Jiang and 
Snir presented a first evaluation of automatically tuning GPU 
linear algebra code [JS05]. 
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4.5. Data Queries 

In this section, we provide a brief overview of the ba- 
sic database queries that can be performed efficiently on a 
GPU [GLW*04]. 

Given a relational table T of m attributes (a\ ,02, ...,a m ), a 
basic SQL query takes the form 

Select A 
from T 
where C 

where A is a list of attributes or aggregations defined on in- 
dividual attributes and C is a Boolean combination of pred- 
icates that have the form a, op aj or <2; op constant. The 
operator op may be any of the following: =,7^, >,>,<,<■ 
Broadly, SQL queries involve three categories of basic oper- 
ations: predicates, Boolean combinations, aggregations, and 
join operations and are implemented efficiently using graph- 
ics processors as follows: 

Predicates We can use the depth test and the stencil test 
functionality for evaluating predicates in the form of a/ op 
constant. Predicates involving comparisons between two 
attributes, a,- op aj, are transformed to (a, — aj) op 0 us- 
ing the programmable pipeline and are evaluated using the 
depth and stencil tests. 

Boolean combinations A Boolean combination of predi- 
cates is expressed in a conjunctive normal form. The sten- 
cil test can be used repeatedly to evaluate a series of log- 
ical operators with the intermediate results stored in the 
stencil buffer. 

Aggregations These include simple operations such as 
COUNT, AVG, and MAX, and can be implemented us- 
ing the counting capability of the occlusion queries. 

Join Operations Join operations combine the records in 
multiple relations using a common join key attribute. 
They are computationally expensive, and can be accel- 
erated by sorting the records based on the join key. The 
fast sorting algorithms described in Section 4.1.5 are 
used to efficiently order the records based on the join 
key [GM05]. 

The attributes of each database record are stored in the 
multiple channels of a single texel, or in the same texel lo- 
cation of multiple textures, and are accessed at run-time to 
evaluate the queries. 

5. GPGPU Applications 

Using many of the algorithms and techniques described in 
the previous section, in this section we survey the broad 
range of applications and tasks implemented on graphics 
hardware. 



years, beginning on machines like the Ikonas [Eng78], the 
Pixel Machine [PH89] , and Pixel-Planes 5 [FPE* 89] . Pixar's 
Chap [LP84] was one of the earliest processors to explore 
a programmable SIMD computational organization, on 16- 
bit integer data; Rap [LHPL87], described three years later, 
extended Chap's integer capabilities with SIMD floating- 
point pipelines. These early graphics computers were typi- 
cally graphics compute servers rather than desktop worksta- 
tions. Early work on procedural texturing and shading was 
performed on the UNC Pixel-Planes 5 and PixelFlow ma- 
chines [RTB*92,OL98]. This work can be seen as precursor 
to the high-level shading languages in common use today 
for both graphics and GPGPU applications. The PixelFlow 
SIMD graphics computer [EMP*97] was also used to crack 
UNIX password encryption [KI99]. 

The wide deployment of GPUs in the last several years 
has resulted in an increase in experimental research with 
graphics hardware. The earliest work on desktop graph- 
ics processors used non-programmable ("fixed-function") 
GPUs. Lengyel et al. used rasterization hardware for robot 
motion planning [LRDG90]. Hoff et al. described the use 
of z-buffer techniques for the computation of Voronoi di- 
agrams [HCK*99] and extended their method to proxim- 
ity detection [HZLM01]. Bohn et al. used fixed-function 
graphics hardware in the computation of artificial neu- 
ral networks [Boh98]. Convolution and wavelet transforms 
with the fixed-function pipeline were realized by Hopf and 
Ertl [HE99a, HE99b] . 

Programmability in GPUs first appeared in the form of 
vertex programs combined with a limited form of fragment 
programmability via extensive user-configurable texture ad- 
dressing and blending operations. While these don't con- 
stitute a true ISA, so to speak, they were abstracted in a 
very simple shading language in Microsoft's pixel shader 
version 1.0 in Direct3D 8.0. Trendall and Stewart gave a 
detailed summary of the types of computation available on 
these GPUs [TS00] . Thompson et al. used the programmable 
vertex processor of an NVIDIA GeForce 3 GPU to solve 
the 3-Satisfiability problem and to perform matrix multipli- 
cation [THO02]. A major limitation of this generation of 
GPUs was the lack of floating-point precision in the frag- 
ment processors. Strzodka showed how to combine mul- 
tiple 8-bit texture channels to create virtual 16-bit precise 
operations [Str02], and Harris analyzed the accumulated er- 
ror in boiling simulation operations caused by the low pre- 
cision [Har02]. Strzodka constructed and analyzed special 
discrete schemes which, for certain PDE types, allow re- 
production of the qualitative behavior of the continuous so- 
lution even with very low computational precision, e.g. 8 
bits [Str04]. 



5.1. Early Work 

The use of computer graphics hardware for general-purpose 
computation has been an area of active research for many 



5.2. Physically-Based Simulation 

Early GPU-based physics simulations used cellular tech- 
niques such as cellular automata (CA). Greg James of 
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NVIDIA demonstrated the "Game of Life" cellular automata 
and a 2D physically based wave simulation running on 
NVIDIA GeForce 3 GPUs [JamOla, JamOlb, JamOlc]. Har- 
ris et al. used a Coupled Map Lattice (CML) to simulate 
dynamic phenomena that can be described by partial differ- 
ential equations, such as boiling, convection, and chemical 
reaction-diffusion [HCSL02]. The reaction-diffusion portion 
of this work was later extended to a finite difference imple- 
mentation of the Gray-Scott equations using floating-point- 
capable GPUs [HJ03]. Kim and Lin used GPUs to simu- 
late dendritic ice crystal growth [KL03]. Related to cellular 
techniques are lattice simulation approaches such as Lattice- 
Boltzmann Methods (LBM), used for fluid and gas simula- 
tion. LBM represents fluid velocity in "packets" traveling 
in discrete directions between lattice cells. Li et al. have 
used GPUs to apply LBM to a variety of fluid flow prob- 
lems [LWK03,LFWK05]. 

Full floating point support in GPUs has enabled the next 
step in physically-based simulation: finite difference and fi- 
nite element techniques for the solution of systems of par- 
tial differential equations (PDEs). Spring-mass dynamics on 
a mesh were used to implement basic cloth simulation on 
a GPU [Gre03, Zel05]. Several researchers have also im- 
plemented particle system simulation on GPUs (see Sec- 
tion 4.3). 

Several groups have used the GPU to successfully simu- 
late fluid dynamics. Four papers in the summer of 2003 pre- 
sented solutions of the Navier-Stokes equations (NSE) for 
incompressible fluid flow on the GPU [BFGS03,GWL*03, 
HBSL03, KW03]. Harris provides an introduction to the 
NSE and a detailed description of a basic GPU imple- 
mentation [Har04]. Harris et al. combined GPU-based NSE 
solutions with PDEs for thermodynamics and water con- 
densation and light scattering simulation to implement vi- 
sual simulation of cloud dynamics [HBSL03]. Other re- 
cent work includes flow calculations around arbitrary obsta- 
cles [BFGS03,KW03,LLW04]. Sander et al. [STM04] de- 
scribed the use of GPU depth-culling hardware to acceler- 
ate flow around obstacles, and sample code that implements 
this technique is made available by Harris [Har05b]. Rumpf 
and Strzodka used a quantized FEM approach to solving 
the anisotropic heat equation on a GPU [RSOlc] (see Sec- 
tion 4.3). 

Related to fluid simulation is the visualization of flows, 
which has been implemented using graphics hardware to ac- 
celerate line integral convolution and Lagrangian-Eulerian 
advection [HWSE99, JEHO 1 , WHEO 1 ] . 

5.3. Signal and Image Processing 

The high computational rates of the GPU have made graph- 
ics hardware an attractive target for demanding applications 
such as those in signal and image processing. Among the 
most prominent applications in this area are those related 
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to image segmentation (Section 5.3.1) as well as a variety 
of other applications across the gamut of signal, image, and 
video processing (Section 5.3.2). 

5.3.1. Segmentation 

The segmentation problem seeks to identify features embed- 
ded in 2D or 3D images. A driving application for segmen- 
tation is medical imaging. A common problem in medical 
imaging is to identify a 3D surface embedded in a volume 
image obtained with an imaging technique such as Magnetic 
Resonance Imaging (MRI) or Computed Tomograph (CT) 
Imaging. Fully automatic segmentation is an unsolved im- 
age processing research problem. Semi-automatic methods, 
however, offer great promise by allowing users to interac- 
tively guide image processing segmentation computations. 
GPGPU segmentation approaches have made a significant 
contribution in this area by providing speedups of more than 
10 x and coupling the fast computation to an interactive vol- 
ume Tenderer. 

Image thresholding is a simple form of segmentation that 
determines if each pixel in an image is within the segmented 
region based on the pixel value. Yang et al. [YW03] used 
register combiners to perform thresholding and basic con- 
volutions on 2D color images. Their NVIDIA GeForce4 
GPU implementation demonstrated a 30% speed increase 
over a 2.2 GHz Intel Pentium 4 CPU. Viola et al. performed 
threshold-based 3D segmentations combined with an inter- 
active visualization system and observed an approximately 
8x speedup over a CPU implementation [VKG03]. 

Implicit surface deformation is a more powerful and accu- 
rate segmentation technique than thresholding but requires 
significantly more computation. These level-set techniques 
specify a partial differential equation (PDE) that evolves an 
initial seed surface toward the final segmented surface. The 
resulting surface is guaranteed to be a continuous, closed 
surface. 

Rumpf et al. were the first to implement level-set segmen- 
tation on GPUs [RSOla]. They supported 2D image segmen- 
tation using a 2D level-set equation with intensity and gra- 
dient image-based forces. Lefohn et al. extended that work 
and demonstrated the first 3D level-set segmentation on the 
GPU [LW02]. Their implementation also supported a more 
complex evolution function that allowed users to control 
the curvature of the evolving segmentation, thus enabling 
smoothing of noisy data. These early implementations com- 
puted the PDE on the entire image despite the fact that only 
pixels near the segmented surface require computation. As 
such, these implementations were not faster than highly op- 
timized sparse CPU implementations. 

The first GPU-based sparse segmentation solvers came a 
year later. Lefohn et al. [LKHW03,LKHW04] demonstrated 
a sparse (narrow-band) 3D level-set solver that provided a 
speedup of 10-15 x over a highly optimized CPU-based 
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solver [Ins03] (Figure 13). They used a page table data struc- 
ture to store and compute only a sparse subset of the volume 
on the GPU. Their scheme used the CPU as a GPU memory 
manager, and the GPU requested memory allocation changes 
by sending a bit vector message to the CPU. Concurrently, 
Sherbondy et al. presented a GPU-based 3D segmentation 
solver based on the Perona-Malik PDE [SHN03]. They also 
performed sparse computation, but had a dense (complete) 
memory representation. They used the depth culling tech- 
nique for conditional execution to perform sparse computa- 
tion. 

Both of the segmentation systems presented by Lefohn 
et al. and Sherbondy et al. were integrated with interactive 
volume Tenderers. As such, users could interactively control 
the evolving PDE computation. Lefohn et al. used their sys- 
tem in a tumor segmentation user study [LCW03]. The study 
found that relatively untrained users could segment tumors 
from a publicly available MRI data set in approximately 
six minutes. The resulting segmentations were more precise 
and equivalently accurate to those produced by trained ex- 
perts who took multiple hours to segment the tumors manu- 
ally. Cates et al. extended this work to multi-channel (color) 
data and provided additional information about the statistical 
methods used to evaluate the study [CLW04] . 




Figure 13: Interactive volume segmentation and visualiza- 
tion of Magnetic Resonance Imaging (MRI) data on the 
GPU enables fast and accurate medical segmentations. Im- 
age generated by Lefohn et al. [LKHW04]. 



5.3.2. Other Signal and Image Processing Applications 

Computer Vision Fung et al. use graphics hardware 
to accelerate image projection and compositing oper- 
ations [FTM02] in a camera-based head-tracking sys- 
tem [FM04]; their implementation has been released as the 
open-source OpenVIDIA computer vision library [Ope05], 
whose website also features a good bibliography of papers 
for GPU-based computer/machine vision applications. 



Yang and Pollefeys used GPUs for real-time stereo depth 
extraction from multiple images [YP05]. Their pipeline first 
rectifies the images using per-pixel projective texture map- 
ping, then computed disparity values between the two im- 
ages, and, using adaptive aggregation windows and cross 
checking, chooses the most accurate disparity value. Their 
implementation was more than four times faster than a com- 
parable CPU-based commercial system. Both Geys et al. and 
Woetzel and Koch addressed a similar problem using a plane 
sweep algorithm. Geys at al. compute depth from pairs of 
images using a fast plane sweep to generate a crude depth 
map, then use a min-cut/max-flow algorithm to refine the 
result [GKV04]; the approach of Woetzel and Koch begins 
with a plane sweep over images from multiple cameras and 
pays particular attention to depth discontinuities [WK04]. 

Image Processing The process of image registration estab- 
lishes a correlation between two images by means of a (pos- 
sibly non-rigid) deformation. The work of Strzodka et al. is 
one of the earliest to use the programmable floating point ca- 
pabilities of graphics hardware in this area [SDR03,SDR04]; 
their image registration implementation is based on the 
multi-scale gradient flow registration method of Clarenz et 
al. [CDR02] and uses an efficient multi-grid representation 
of the image multi-scales, a fast multi-grid regularization, 
and an adaptive time-step control of the iterative solvers. 
They achieve per-frame computation time of under 2 sec- 
onds on pairs of 256x256 images. 

Strzodka and Garbe describe a real-time system that com- 
putes and visualizes motion on 640 x 480 25 Hz 2D image 
sequences using graphics hardware [SG04]. Their system 
assumes that image brightness only changes due to motion 
(due to the brightness change constraint equation). Using 
this assumption, they estimate the motion vectors from cal- 
culating the eigenvalues and eigenvectors of the matrix con- 
structed from the averaged partial space and time derivatives 
of image brightness. Their system is 4.5 times faster than a 
CPU equivalent (as of May 2004), and they expect the addi- 
tional arithmetic capability of newer graphics hardware will 
allow the use of more advanced estimation models (such as 
estimation of brightness changes) in real time. 

Computed tomography (CT) methods that reconstruct 
an object from its projections are computationally inten- 
sive and often accelerated by special-purpose hardware. 
Xu and Mueller implement three 3D reconstruction algo- 
rithms (Feldkamp Filtered Backprojection, SART, and EM) 
on programmable graphics hardware, achieving high-quality 
floating-point 128' reconstructions from 80 projections in 
timeframes from seconds to tens of seconds [XM05]. 

Signal Processing Motivated by the high arithmetic ca- 
pabilities of modem GPUs, several projects have devel- 
oped GPU implementations of the fast Fourier transform 
(FFT) [BFH*04,JvHK04,MA03,SL05]. (The GPU Gems 
2 chapter by Sumanaweera and Liu, in particular, gives a 
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detailed description of the FFT and their GPU implemen- 
tation [SL05].) In general, these implementations operate 
on ID or 2D input data, use a radix-2 decimation-in-time 
approach, and require one fragment-program pass per FFT 
stage. The real and imaginary components of the FFT can be 
computed in two components of the 4-vectors in each frag- 
ment processor, so two FFTs can easily be processed in par- 
allel. These implementations are primarily limited by mem- 
ory bandwidth and the lack of effective caching in today's 
GPUs, and only by processing two FFTs simultaneously can 
match the performance of a highly tuned CPU implementa- 
tion [FJ98]. 

Daniel Horn maintains an open-source optimized FFT li- 
brary based on the Brook distribution [Hor05a] . The discrete 
wavelet transform (DWT), used in the JPEG2000 standard, 
is another useful fundamental signal processing operation; a 
group from the Chinese University of Hong Kong has devel- 
oped a GPU implementation of the DFT [WWHL05], which 
has been integrated into an open-source JPEG2000 codec 
called "JasPer" [Ada05]. 



Tone Mapping Tone mapping is the process of mapping 
pixel intensity values with high dynamic range to the smaller 
range permitted by a display. Goodnight et al. implemented 
an interactive, time-dependent tone mapping system on 
GPUs [GWWH03]. In their implementation, they chose the 
tone-mapping algorithm of Reinhard et al. [RSSF02], which 
is based on the "zone system" of photography, for two rea- 
sons. First, the transfer function that performs the tone map- 
ping uses a minimum of global information about the im- 
age, making it well-suited to implementation on graphics 
hardware. Second, Reinhard et al.'s algorithm can be adap- 
tively refined, allowing a GPU implementation to trade off 
efficiency and accuracy. Among the tasks in Goodnight et 
al.'s pipeline was an optimized implementation of a Gaus- 
sian convolution. On an ATI Radeon 9800, they were able 
to achieve highly interactive frame rates with few adaptation 
zones (limited by mipmap construction) and a few frames 
per second with many adaptation zones (limited by the per- 
formance of the Gaussian convolution). 



Audio Je t drzejewski used ray tracing techniques on GPUs 
to compute echoes of sound sources in highly occluded en- 
vironments [Jqd04]. BionicFX has developed commercial 
'Audio Video Exchange" (AVEX) software that accelerates 
audio effect calculations using GPUs [Bio05]. 



Image/Video Processing Frameworks Apple's Core Im- 
age and Core Video frameworks allow GPU acceleration 
of image and video processing tasks [App05a]; the open- 
source framework Jahshaka uses GPUs to accelerate video 
compositing [Jah05]. 

© The Eurographics Association 2005. 



5.4. Global Illumination 

Perhaps not surprisingly, one of the early areas of GPGPU 
research was aimed at improving the visual quality of GPU 
generated images. Many of the techniques described below 
accomplish this by simulating an entirely different image 
generation process from within a fragment program (e.g. a 
ray tracer). These techniques use the GPU strictly as a com- 
puting engine. Other techniques leverage the GPU to per- 
form most of the rendering work, and augment the result- 
ing image with global effects. Figure 14 shows images from 
some of the techniques we discuss in this section. 

5.4.1. Ray Tracing 

Ray tracing is a rendering technique based on simulating 
light interactions with surfaces [Whi80]. It is nearly the re- 
verse of the traditional GPU rendering algorithm: the color 
of each pixel in an image is computed by tracing rays out 
from the scene camera and discovering which surfaces are 
intersected by those rays and how light interacts with those 
surfaces. The ray-surface intersection serves as a core for 
many global illumination algorithms. Perhaps it is not sur- 
prising, then, that ray tracing was one of the earliest GPGPU 
global illumination techniques to be implemented. 

Ray tracing consists of several types of computation: ray 
generation, ray-surface intersection, and ray-surface shad- 
ing. Generally, there are too many surfaces in a scene to 
brute-force-test every ray against every surface for intersec- 
tion, so there are several data structures that reduce the total 
number of surfaces rays need to test against (called accelera- 
tion structures). Ray-surface shading generally requires gen- 
erating additional rays to test against the scene (e.g. shadow 
rays, reflection rays, etc.) The earliest GPGPU ray tracing 
systems demonstrated that the GPU was capable of not only 
performing ray-triangle intersections [CHH02], but that the 
entire ray tracing computation including acceleration struc- 
ture traversal and shading could be implemented entirely 
within a set of fragment programs [PBMH02, Pur04], Sec- 
tion 4.2 enumerates several of the data structures used in this 
ray tracer. 

Some of the early ray tracing work required special 
drivers, as features like fragment programs and floating point 
buffers were relatively new and rapidly evolving. There are 
currently open source GPU-based ray tracers that run with 
standard drivers and APIs [Chr05,KL04], 

Weiskopf et al. have implemented nonlinear ray tracing on 
the GPU [WSE04]. Nonlinear ray tracing is a technique that 
can be used for visualizing gravitational phenomena such as 
black holes, or light propagation through media with a vary- 
ing index of refraction (which can produce mirages). Their 
technique builds upon the linear ray tracing discussed previ- 
ously, and approximates curved rays with multiple ray seg- 
ments. 
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(a) (b) (c) (d) 

Figure 14: Sample images from several global illumination techniques implemented on the GPU. (a) Ray tracing and photon 
mapping [PDC*03]. (b) Radiosity [CHL04]. (c) Subsurface scattering [CHH03]. (d) Final gather by rasterization [Hac05]. 



5.4.2. Photon Mapping 

Photon mapping [Jen96] is a two-stage global illumination 
algorithm. The first stage consists of emitting photons out 
from the light sources in the scene, simulating the photon 
interactions with surfaces, and finally storing the photons 
in a data structure for lookup during the second stage. The 
second stage in the photon mapping algorithm is a render- 
ing stage. Initial surface visibility and direct illumination are 
computed first, often by ray tracing. Then, the light at each 
surface point that was contributed by the environment (indi- 
rect) or through focusing by reflection or refraction (caustic) 
is computed. These computations are done by querying the 
photon map to get estimates for the amount of energy that 
arrived from these sources. 

Tracing photons is much like ray tracing discussed previ- 
ously. Constructing the photon map and indexing the map to 
find good energy estimates at each image point are much 
more difficult on the GPU. Ma and McCool proposed a 
low-latency photon lookup algorithm based on hash ta- 
bles [MM02]. Their algorithm was never implemented on 
the GPU, and construction of the hash table is not currently 
amenable to a GPU implementation. Purcell et al. imple- 
mented two different techniques for constructing the pho- 
ton map and a technique for querying the photon map, all of 
which run at interactive rates [PDC*03] (see Sections 4.1.6 
and 4.2 for some implementation details). Figure 14a shows 
an image rendered with this system. Finally, Larsen and 
Christensen load-balance photon mapping between the GPU 
and the CPU and exploit inter-frame coherence to achieve 
very high frame rates for photon mapping [LC04]. 

5.4.3. Radiosity 

At a high level, radiosity works much like photon mapping 
when computing global illumination for diffuse surfaces. In 
a radiosity-based algorithm, energy is transferred around the 
scene much like photons are. Unlike photon mapping, the 
energy is not stored in a separate data structure that can be 
queried at a later time. Instead, the geometry in the scene is 
subdivided into patches or elements, and each patch stores 
the energy arriving on that patch. 



The classical radiosity algorithm [GTGB84] solves 
for all energy transfer simultaneously. Classical radiosity 
was implemented on the GPU with an iterative Jacobi 
solver [CHH03]. The implementation was limited to matri- 
ces of around 2000 elements, severely limiting the complex- 
ity of the scenes that can be rendered. 

An alternate method for solving radiosity equations, 
known as progressive radiosity, iterates through the energy 
transfer until the system reaches a steady state [CCWG88]. 
A GPU implementation of progressive radiosity can render 
scenes with over one million elements [CHL04,CH05]. Fig- 
ure 14b shows a sample image created with progressive re- 
finement radiosity on the GPU. 

5.4.4. Subsurface Scattering 

Most real-world surfaces do not completely absorb, reflect, 
or refract incoming light. Instead, incoming light usually 
penetrates the surface and exits the surface at another lo- 
cation. This subsurface scattering effect is an important 
component in modeling the appearance of transparent sur- 
faces [HK93]. This subtle yet important effect has also been 
implemented on the GPU [CHH03]. Figure 14c shows an 
example of GPU subsurface scattering. The GPU implemen- 
tation of subsurface scattering uses a three-pass algorithm. 
First, the amount of light on a given patch in the model is 
computed. Second, a texture map of the transmitted radiosity 
is built using precomputed scattering links. Finally, the gen- 
erated texture is applied to the model. This method for com- 
puting subsurface scattering runs in real time on the GPU. 

5.4.5. Hybrid Rendering 

Finally, several GPGPU global illumination methods that 
have been developed do not fit with any of the classically 
defined rendering techniques. Some methods use traditional 
GPU rendering in unconventional ways to obtain global illu- 
mination effects. Others combine traditional GPU rendering 
techniques with global illumination effects and combine the 
results. We call all of these techniques hybrid global illumi- 
nation techniques. 

© The Eurographics Association 2005. 
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The Parthenon Tenderer generates global illumination im- 
ages by rasterizing the scene multiple times, from different 
points of view [Hac05]. Each of these scene rasterizations 
is accumulated to form an estimate of the indirect illumina- 
tion at each visible point. This indirect illumination estimate 
is combined with direct illumination computed with a tradi- 
tional rendering technique like photon mapping. A sample 
image from the Parthenon Tenderer is shown in Figure 14d. 
In a similar fashion, Nijasure computes a sparse sampling of 
the scene for indirect illumination into cubemaps [Nij03]. 
The indirect illumination is progressively computed and 
summed with direct lighting to produce a fully illuminated 
scene. 

Finally, Szirmay-Kalos et al. demonstrate how to approx- 
imate ray tracing on the GPU by localizing environment 
maps [SKALP05]. They use fragment programs to correct 
reflection map lookups to more closely match what a ray 
tracer would compute. Their technique can also be used to 
generate multiple refractions or caustics, and runs in real 
time on the GPU. 

5.5. Geometric Computing 

GPUs have been widely used for performing a number of 
geometric computations. These geometric computations are 
used in many applications including motion planning, virtual 
reality, etc. and include the following. 

Constructive Solid Geometry (CSG) operations 

CSG operations are used for geometric model- 
ing in computer aided design applications. Basic 
CSG operations involve Boolean operations such as 
union, intersection, and difference, and can be imple- 
mented efficiently using the depth test and the stencil 
test [GHF86, RR86, GMTF89, SLJ98, GKMV03]. 
Distance Fields and Skeletons Distance fields compute 
the minimum distance of each point to a set of objects 
and are useful in applications such as path planning and 
navigation. Distance computation can be performed either 
using a fragment program or by rendering the distance 
function of each object in image space [HCK*99,SOM04, 
SPG03,ST04]. 

Collision Detection GPU-based collision detection algo- 
rithms rasterize the objects and perform either 2D or 
2.5-D overlap tests in screen space [BW03, HTG03, 
HTG04, HCK* 99, KP03, MOK95, RMS92, SF9 1 , VSCO 1 , 
GRLM03]. Furthermore, visibility computations can be 
performed using occlusion queries and used to compute 
both intra- and inter-object collisions among multiple ob- 
jects [GLM05]. 

Transparency Computation Transparency computations 
require the sorting of 3D primitives or their image-space 
fragments in a back-to-front or a front-to-back order and 
can be performed using depth peeling [EveOl] or by 
image-space occlusion queries [GHLM05]. 

Shadow Generation Shadows correspond to the regions 
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visible to the eye and not visible to the light. Popular 
techniques include variations of shadow maps [SKv*92, 
HS99, BAS02, SD02, Sen04, LSK*05] and shadow vol- 
umes [Cro77, Hei91, EK02]. Algorithms have also been 
proposed to generate soft shadows [BS02, ADMAM03, 
CD03]. 

These algorithms perform computations in image space, 
and require little or no pre-processing. Therefore, they work 
well on deformable objects. However, the accuracy of these 
algorithms is limited to image precision, and can be an issue 
in some geometric computations such as collision detection. 
Recently, Govindaraju et al. proposed a simple technique to 
overcome the image-precision error by sufficiently "fatten- 
ing" the primitives [GLM04]. The technique has been used 
in performing reliable inter- and intra-object collision com- 
putations among general deformable meshes [GKJ*05]. 

The performance of many geometric algorithms on GPUs 
is also dependent upon the layout of polygonal meshes and a 
better layout effectively utilizes the caches on GPUs such as 
vertex caches. Recently, Yoon et al. proposed a novel method 
for computing cache-oblivious layouts of polygonal meshes 
and applied it to improve the performance of geometric ap- 
plications such as view-dependent rendering and collision 
detection on GPUs [YLPM05]. Their method does not re- 
quire any knowledge of cache parameters and does not make 
assumptions on the data access patterns of applications. A 
user constructs a graph representing an access pattern of an 
application, and the cache-oblivious algorithm constructs a 
mesh layout that works well with the cache parameters. The 
cache-oblivious algorithm was able to achieve 2-20 x im- 
provement on many complex scenarios without any modi- 
fication to the underlying application or the run-time algo- 
rithm. 

5.6. Databases and Data Mining 

Database Management Systems (DBMSs) and data mining 
algorithms are an integral part of a wide variety of commer- 
cial applications such as online stock marketing and intru- 
sion detection systems. Many of these applications analyze 
large volumes of online data and are highly computation- 
and memory-intensive. As a result, researchers have been ac- 
tively seeking new techniques and architectures to improve 
the query execution time. The high memory bandwidth and 
the parallel processing capabilities of the GPU can signifi- 
cantly accelerate the performance of many essential database 
queries such as conjunctive selections, aggregations, semi- 
linear queries and join queries. These queries are described 
in Section 4.5. Govindaraju et al. compared the performance 
of SQL queries on an NVIDIA GeForce 6800 against a 
2.8 GHz Intel Xeon processor. Preliminary comparisons in- 
dicate up to an order of magnitude improvement for the GPU 
over a SIMD-optimized CPU implementation [GLW*04]. 

GPUs are highly optimized for performing rendering op- 
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erations on geometric primitives and can use these capabil- 
ities to accelerate spatial database operations. Sun et al. ex- 
ploited the color blending capabilities of GPUs for spatial se- 
lection and join operations on real world datasets [SAA03]. 
Bandi et al. integrated GPU-based algorithms for improving 
the performance of spatial database operations into Oracle 
91 DBMS [BSAE04]. 

Recent research has also focused attention on the effec- 
tive utilization of graphics processors for fast stream min- 
ing algorithms. In these algorithms, data is collected con- 
tinuously and the underlying algorithm performs contin- 
uous queries on the data stream as opposed to one-time 
queries in traditional systems. Many researchers have advo- 
cated the use of GPUs as stream processors for compute- 
intensive algorithms [BFH*04, FF88, Man03, Ven03]. Re- 
cently, Govindaraju et al. have presented fast streaming al- 
gorithms using the blending and texture mapping function- 
alities of GPUs [GRM05]. Data is streamed to and from the 
GPU in real-time, and a speedup of 2-5 times is demon- 
strated on online frequency and quantile estimation queries 
over high-end CPU implementations. The high growth rate 
of GPUs, combined with their substantial processing power, 
are making the GPU a viable architecture for commercial 
database and data mining applications. 



6. Conclusions: Looking Forward 

The field of GPGPU computing is approaching something 
like maturity. Early efforts were characterized by a some- 
what ad hoc approach and a "GPGPU for its own sake" at- 
titude; the challenge of achieving non-graphics computation 
on the graphics platform overshadowed analysis of the tech- 
niques developed or careful comparison to well optimized, 
best-in-class CPU analogs. Today researchers in GPGPU 
typically face a much higher bar, set by careful analyses 
such as Fatahalian et al.'s examination of matrix multipli- 
cation [FSH04]. The bar is higher for novelty as well as 
analysis; new work must go beyond simply "porting" an 
existing algorithm to the GPU, to demonstrating general 
principles and techniques or making significantly new and 
non-obvious use of the hardware. Fortunately, the accumu- 
lated body of knowledge on general techniques and building 
blocks surveyed in Section 4 means that GPGPU researchers 
need not continually reinvent the wheel. Meanwhile, devel- 
opers wishing to use GPUs for general-purpose computing 
have a broad array of applications to learn from and build 
on. GPGPU algorithms continue to be developed for a wide 
range of problems, from options pricing to protein folding. 
On the systems side, several research groups have major on- 
going efforts to perform large-scale GPGPU computing by 
harnessing large clusters of GPU-equipped computers. The 
emergence of high-level programming languages provided a 
huge leap forward for GPU developers generally, and lan- 
guages like BrookGPU [BFH*04] hold similar promise for 



non-graphics developers who wish to harness the power of 
GPUs. 

More broadly, GPUs may be seen as the first genera- 
tion of commodity data-parallel coprocessors. Their tremen- 
dous computational capacity and rapid growth curve, far 
outstripping traditional CPUs, highlight the advantages of 
domain-specialized data-parallel computing. We can expect 
increased programmability and generality from future GPU 
architectures, but not without limit; neither vendors nor users 
want to sacrifice the specialized performance and archi- 
tecture that have made GPUs successful in the first place. 
The next generation of GPU architects face the challenge 
of striking the right balance between improved generality 
and ever-increasing performance. At the same time, other 
data-parallel processors are beginning to appear in the mass 
market, most notably the Cell processor produced by IBM, 
Sony, and Toshiba [PAB*05]. The tiled architecture of the 
Cell provides a dense computational fabric well suited to 
the stream programming model discussed in Section 2.3, 
similar in many ways to GPUs but potentially better suited 
for general-purpose computing. As GPUs grow more gen- 
eral, low-level programming is supplanted by high-level lan- 
guages and toolkits, and new contenders such as the Cell 
chip emerge, GPGPU researchers face the challenge of tran- 
scending their computer graphics roots and developing com- 
putational idioms, techniques, and frameworks for desktop 
data-parallel computing. 

Acknowledgements 

Thanks to Ian Buck, Jeff Bolz, Daniel Horn, Marc Pollefeys, 
and Robert Strzodka for their thoughtful comments, and to 
the anonymous reviewers for their helpful and constructive 
criticism. 

References 

[Ada05] ADAMS M.: JasPer project, http : / /www . 

ece.uvic . ca/~mdadams/ jasper/, 
2005. 

[ADMAM03] ASSARSSON U., DOUGHERTY M., 
MOUNIER M., AKENINE-MOLLER T.: 
An optimized soft shadow volume algorithm 
with real-time performance. In Graphics 
Hardware 2003 (July 2003), pp. 33-40. 

[App05a] Apple Computer Core Image. http: 
/ /www. apple . com/macosx/ tiger/ 
coreimage . html, 2005. 

[App05b] Apple Computer OpenGL shader builder / 
profiler, http: //developer . apple . 
com/ graphicsimaging/opengl/, 
2005. 

[BAS02] BRABEC S., ANNEN T., SEIDEL H.-P.: 

© The Eurographics Association 2005. 



John D.Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. 
Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on Graphics 
Hardware." In Eurographics 2005, State of the Art Reports, August 2005, pp. 21 -51 . 



Owens, Luebke, Govindaraju, Harris, Krtiger, Lefohn, and Purcell I A Survey of General-Purpose Computation on Graphics Hardware 43 



Shadow mapping for hemispherical and om- 
nidirectional light sources. In Advances in 
Modelling, Animation and Rendering (Pro- 
ceedings of Computer Graphics Interna- 
tional 2002) (July 2002), pp. 397-408. 

[Bat68] BATCHER K. E.: Sorting networks and their 

applications. In Proceedings of the AF1PS 
Spring Joint Computing Conference (Apr. 
1968), vol. 32, pp. 307-314. 

[Bax05] BAXTER B.: The image debugger. 

http : //www. billbaxter . com/ 
pro jects/imdebug/, 2005. 

[Ben75] BENTLEY J. L.: Multidimensional binary 

search trees used for associative searching. 
Communications of the ACM 18, 9 (Sept. 
1975), 509-517. 

[BFGS03] Bolz J., Farmer I., Grinspun E., 
SCHRODER P.: Sparse matrix solvers on 
the GPU: Conjugate gradients and multigrid. 
ACM Transactions on Graphics 22, 3 (July 
2003), 917-924. 

[BFH*04] Buck I., Foley T., Horn D., Sugerman 
J., Fatahalian K., Houston M., Han- 
RAHAN P. : Brook for GPUs: Stream comput- 
ing on graphics hardware. ACM Transactions 
on Graphics 23, 3 (Aug. 2004), 777-786. 

[BHM00] BRIGGS W. L., HENSON V. E., MC- 
CORMICK S. F.: A Multigrid Tutorial: Sec- 
ond Edition. Society for Industrial and Ap- 
plied Mathematics, Philadelphia, PA, USA, 
2000. 

[Bio05] BionicFX. http: //www. bionicfx. 

com/, 2005. 

[Boh98] BOHN C. A.: Kohonen feature mapping 

through graphics hardware. In Proceedings 
of the Joint Conference on Information Sci- 
ences (1998), vol. II, pp. 64-67. 

[BP03] BLEIWEISS A., PREETHAM A.: Ashli — 

Advanced shading language interface. 
ACM SIGGRAPH Course Notes (2003). 
http : //www. ati . com/ developer/ 
SIGGRAPH03/AshliNotes.pdf. 

[BP04] BUCK I., PURCELL T.: A toolkit for compu- 

tation on GPUs. In GPU Gems, Fernando R., 
(Ed.). Addison Wesley, Mar. 2004, pp. 621- 
636. 

[BS02] BRABEC S., SEIDEL H.-P.: Single sample 

soft shadows using depth maps. In Graphics 
Interface (May 2002), pp. 219-228. 

[BSAE04] Bandi N., Sun C, Agrawal D., El Ab- 



BADI A.: Hardware acceleration in commer- 
cial databases: A case study of spatial opera- 
tions, pp. 1021-1032. 

[Buc04] BUCK I.: GPGPU: General-purpose compu- 

tation on graphics hardware — GPU compu- 
tation strategies & tricks. ACM SIGGRAPH 
Course Notes (Aug. 2004). 

[Buc05] BUCK I.: Taking the plunge into GPU com- 

puting. In GPU Gems 2, Pharr M., (Ed.). 
Addison Wesley, Mar. 2005, ch. 32, pp. 509- 
519. 

[BW03] BACIU G., WONG W. S. K.: Image- 

based techniques in a hybrid collision detec- 
tor. IEEE Transactions on Visualization and 
Computer Graphics 9, 2 (Apr. 2003), 254- 
271. 

[CCWG88] Cohen M. F., Chen S. E., Wallace 
J. R., GREENBERG D. P.: A progressive 
refinement approach to fast radiosity image 
generation. In Computer Graphics (Proceed- 
ings of SIGGRAPH 88) (Aug. 1988), vol. 22, 
pp. 75-84. 

[CD03] CHAN E., DURAND F.: Rendering fake soft 

shadows with smoothies. In Eurographics 
Symposium on Rendering: 14th Eurograph- 
ics Workshop on Rendering (June 2003), 
pp. 208-218. 

[CDR02] CLARENZ U., DROSKE M., Rumpf M.: 
Towards fast non-rigid registration. In In- 
verse Problems, Image Analysis and Medi- 
cal Imaging, AMS Special Session Interac- 
tion of Inverse Problems and Image Analysis 
(2002), vol. 313, AMS, pp. 67-84. 

[CH05] COOMBE G., HARRIS M.: Global illumina- 

tion using progressive refinement radiosity. 
In GPU Gems 2, Pharr M., (Ed.). Addison 
Wesley, Mar. 2005, ch. 39, pp. 635-647. 

[CHH02] Carr N. A., Hall J. D., Hart J. C: 
The ray engine. In Graphics Hardware 2002 
(Sept. 2002), pp. 37-46. 

[CHH03] Carr N. A., Hall J. D., Hart J. C: GPU 
algorithms for radiosity and subsurface scat- 
tering. In Graphics Hardware 2003 (July 
2003), pp. 51-59. 

[CHL04] Coombe G., Harris M. J., Lastra A.: 
Radiosity on graphics hardware. In Proceed- 
ings of the 2004 Conference on Graphics In- 
terface (May 2004), pp. 161-168. 

[Chr05] CHRISTEN M.: Ray Tracing on GPU. Mas- 

ter's thesis, University of Applied Sciences 
Basel, 2005. 



© The Eurographics Association 2005. 



John D.Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. 
Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on Graphics 
Hardware." In Eurographics 2005, State of the Art Reports, August 2005, pp. 21 -51 . 



44 Owens, Luebke, Govindaraju, Harris, Kriiger, Lefohn, and Purcell I A Survey of General-Purpose Computation on Graphics Hardware 



[CLW04] Cates J. E., Lefohn A. E., Whitaker 
R. T.: GIST: An interactive, GPU-based 
level-set segmentation tool for 3D medical 
images. Medical Image Analysis 10, 4 (July/ 
Aug. 2004), 217-231. 

[CND03] CALLELE D., NEUFELD E., De- 
LATHOUWERK.: Sorting on a GPU. http: 
//www. cs .usask.ca/f acuity/ 
callele/gpusort/gpusort . html, 
2003. 

[Cro77] CROW F. C: Shadow algorithms for com- 

puter graphics. In Computer Graphics (Pro- 
ceedings of SIGGRAPH 77) (July 1977), 
vol. 11, pp. 242-248. 

[DNB*05] DUCA N., NlSKI K., BlLODEAU J., 
BOLITHO M., CHEN Y., Cohen J.: A re- 
lational debugging engine for the graphics 
pipeline. ACM Transactions on Graphics 24, 
3 (Aug. 2005). To appear. 

[DPRS89] Dowd M., Perl Y., Rudolph L., Saks 
M.: The periodic balanced sorting network. 
Journal of the ACM 36, 4 (Oct. 1989), 738- 
757. 

[EK02] EVERITT C, KlLGARD M.: Practical 

and robust stenciled shadow volumes for 
hardware-accelerated rendering. ACM SIG- 
GRAPH Course Notes 31 (2002). 

[EMP*97] EYLES J., MOLNAR S., POULTON J., 

Greer T., Lastra A., England N., 
WESTOVER L.: PixelFlow: The realization. 
In 1997 SIGGRAPH I Eurographics Work- 
shop on Graphics Hardware (Aug. 1997), 
pp. 57-68. 

[Eng78] ENGLAND J. N.: A system for interactive 

modeling of physical curved surface objects. 
In Computer Graphics (Proceedings of SIG- 
GRAPH 78) (Aug. 1978), vol. 12, pp. 336- 
340. 

[EveOl] EVERITT C: Interactive Order- 

Independent Transparency. Tech. 
rep., NVIDIA Corporation, May 2001. 
http : / / developer . nvidia . com/ 
ob ject/Interactive_Order_ 
Transparency . html. 

[EVG04] Ernst M., Vogelgsang C, Greiner 
G.: Stack implementation on programmable 
graphics hardware. In Proceedings of Vision, 
Modeling, and Visualization (Nov. 2004), 
pp. 255-262. 

[EWN05] EKMAN M., Warg E, NlLSSON J.: An 
in-depth look at computer performance 



growth. ACM SIGARCH Computer Architec- 
ture News 33, 1 (Mar. 2005), 144-147. 

[FF88] FOURNIER A., FUSSELL D.: On the power 

of the frame buffer. ACM Transactions on 
Graphics 7, 2 (1988), 103-128. 

[FJ98] Frigo M., Johnson S. G.: FFTW: An 

adaptive software architecture for the FFT. In 
Proceedings of the 1998 International Con- 
ference on Acoustics, Speech, and Signal 
Processing (May 1998), vol. 3, pp. 1381- 
1384. 

[FM04] FUNG J., MANN S.: Computer vision sig- 

nal processing on graphics processing units. 
In Proceedings of the IEEE lnternatioiml 
Conference on Acoustics, Speech, and Signal 
Processing (May 2004), vol. 5, pp. 93-96. 

[FPE*89] FUCHS H., POULTON J., EYLES J., GREER 
T., GOLDFEATHER J., ELLSWORTH D., 

Molnar S., Turk G., Tebbs B., Israel 
L.: Pixel-Planes 5: A heterogeneous multi- 
processor graphics system using processor- 
enhanced memories. In Computer Graphics 
(Proceedings of SIGGRAPH 89) (July 1989), 
vol. 23, pp. 79-88. 

[FS05] FOLEY T., SUGERMAN J.: KD-Tree accel- 

eration structures for a GPU raytracer. In 
Graphics Hardware 2005 (July 2005). To ap- 
pear. 

[FSH04] Fatahalian K., Sugerman J., Hanra- 
HAN P.: Understanding the efficiency of 
GPU algorithms for matrix-matrix multipli- 
cation. In Graphics Hardware 2004 (Aug. 
2004), pp. 133-138. 

[FTM02] FUNG J., TANG F., Mann S.: Mediated re- 
ality using computer graphics hardware for 
computer vision. In 6th International Sym- 
posium on Wearable Computing (Oct. 2002), 
pp. 83-89. 

[GHF86] GOLDFEATHER J., HULTQUIST J. P. M., 

FUCHS H.: Fast constructive-solid geometry 
display in the Pixel-Powers graphics system. 
In Computer Graphics (Proceedings of SIG- 
GRAPH 86) (Aug. 1986), vol. 20, pp. 107- 
116. 

[GHLM05] Govindaraju N. K., Henson M., Lin 
M. C, MANOCHA D.: Interactive visibility 
ordering of geometric primitives in complex 
environments. In Proceedings of the 2005 
Symposium on Interactive 3D Graphics and 
Games (Apr. 2005), pp. 49-56. 

[GKJ*05] Govindaraju N. K., Knott D., Jain N., 



© The Eurographics Association 2005. 



John D.Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. 
Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on Graphics 
Hardware." In Eurographics 2005, State of the Art Reports, August 2005, pp. 21 -51 . 



Owens, Luebke, Govindaraju, Harris, Krtiger, Lefohn, and Purcell I A Survey of General-Purpose Computation on Graphics Hardware 45 



[GKMV03] 



[GKV04] 



[GLM04] 



[GLM05] 



[GLW*04] 



[GM05] 



[GMTF89] 



[GPU05] 

[Gra05] 
[Gre03] 



Kabul I., Tamstorf R., Gayle R., Lin 
M. C., MANOCHA D.: Interactive colli- 
sion detection between deformable models 
using chromatic decomposition. ACM Trans- 
actions on Graphics 24, 3 (Aug. 2005). To 
appear. 

Guha S., Krishnan S., Munagala K., 
VENKATASUBRAMANIAN S.: Application 
of the two-sided depth test to CSG render- 
ing. In 2003 ACM Symposium on Interactive 
3D Graphics (Apr. 2003), pp. 177-180. 

Geys I., Koninckx T. P., Van Gool L.: 
Fast interpolated cameras by combining a 
GPU based plane sweep with a max-flow 
regularisation algorithm. In Proceedings 
of the 2nd International Symposium on 3D 
Data Processing, Visualization and Trans- 
mission (Sept. 2004), pp. 534-541. 

Govindaraju N. K., Lin M. C., 
MANOCHA D.: Fast and reliable collision 
culling using graphics hardware. In Proceed- 
ings of ACM Virtual Reality and Software 
Technology (Nov. 2004). 

Govindaraju N. K., Lin M. C., 
MANOCHA D.: Quick-CULLIDE: Efficient 
inter- and intra-object collision culling using 
graphics hardware. In Proceedings of IEEE 
Virtual Reality (Mar. 2005), pp. 59-66. 

Govindaraju N. K., Lloyd B., Wang 
W., LlN M., MANOCHA D.: Fast compu- 
tation of database operations using graphics 
processors. In Proceedings of the 2004 ACM 
S1GMOD International Conference on Man- 
agement of Data (June 2004), pp. 215-226. 

Govindaraju N. K., Manocha D.: Ef- 
ficient relational database management us- 
ing graphics processors. In ACM S1GMOD 
Workshop on Data Management on New 
Hardware (June 2005), pp. 29-34. 

GOLDFEATHER J., MOLNAR S., TURK G., 
FUCHS H.: Near real-time CSG rendering 
using tree normalization and geometric prun- 
ing. IEEE Computer Graphics & Applica- 
tions 9, 3 (May 1989), 20-28. 

GPUSort: A high performance GPU sorting 
library, http://gamma.cs.unc.edu/ 
GPUSORT/, 2005. 

Graphic Remedy gDEBugger. http : / / 
www. gremedy . com/, 2005. 

GREEN S.: NVIDIA cloth sample, 
http : / / download . developer . 



nvidia . com/ developer/ SDK/ 
Individual_Samples/ samples . 
html#glsl_physics, 2003. 

[Gre04] GREEN S.: NVIDIA particle system sam- 

ple, http : //download . developer . 
nvidia . com/ developer/ SDK/ 
Individual_Samples/ samples . 
html#gpu_particles, 2004. 

[GRH*05] Govindaraju N. K., Raghuvanshi N., 
Henson M., Tuft D., Manocha D.: 
A Cache-Efficient Sorting Algorithm for 
Database and Data Mining Computations 
using Graphics Processors. Tech. Rep. 
TR05-016, University of North Carolina, 
2005. 

[GRLM03] Govindaraju N. K., Redon S., Lin 
M. C, MANOCHA D.: CULLIDE: Inter- 
active collision detection between complex 
models in large environments using graphics 
hardware. In Graphics Hardware 2003 (July 
2003), pp. 25-32. 

[GRM05] Govindaraju N. K., Raghuvanshi N., 
MANOCHA D.: Fast and approximate stream 
mining of quantiles and frequencies using 
graphics processors. In Proceedings of 
the 2005 ACM SIGMOD International Con- 
ference on Management of Data (2005), 
pp. 611-622. 

[GTGB84] GORAL C. M., TORRANCE K. E., GREEN- 
BERG D. P., BATTAILE B.: Modelling the 
interaction of light between diffuse surfaces. 
In Computer Graphics (Proceedings ofSIG- 
GRAPH 84) (July 1984), vol. 18, pp. 213- 
222. 

[GV96] GOLUB G. H., VAN Loan C. E: Ma- 

trix Computations, Third Edition. The Johns 
Hopkins University Press, Baltimore, 1996. 

[GWL*03] Goodnight N., Woolley C, Lewin G., 
Luebke D., Humphreys G.: A multigrid 
solver for boundary value problems using 
programmable graphics hardware. In Graph- 
ics Hardware 2003 (July 2003), pp. 102-1 1 1 . 

[GWWH03] Goodnight N., Wang R., Wool- 
ley C, HUMPHREYS G.: Interactive 
time-dependent tone mapping using pro- 
grammable graphics hardware. In Euro- 
graphics Symposium on Rendering: 14th 
Eurographics Workshop on Rendering (June 
2003), pp. 26-37. 

[Hac05] HACHISUKA T.: High-quality global illumi- 

nation rendering using rasterization. In GPU 



© The Eurographics Association 2005. 



John D.Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krtiger, Aaron E. 
Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on Graphics 
Hardware." In Eurographics 2005, State of the Art Reports, August 2005, pp. 21 -51 . 



46 Owens, Luebke, Govindaraju, Harris, Kriiger, Lefohn, and Purcell I A Survey of General-Purpose Computation on Graphics Hardware 



Gems 2, Pharr M., (Ed.). Addison Wesley, 
Mar. 2005, ch. 38, pp. 615-633. 

[Har02] HARRIS M. J.: Analysis of Error in a CML [HJ03] 

Diffusion Operation. Tech. Rep. TR02-015, 
University of North Carolina, 2002. 

[Har04] HARRIS M.: Fast fluid dynamics simulation 

on the GPU. In GPU Gems, Fernando R., [HK93] 

(Ed.). Addison Wesley, Mar. 2004, pp. 637- 

665. 

[Har05a] HARRIS M.: Mapping computational con- 
cepts to GPUs. In GPU Gems 2, Pharr M., 
(Ed.). Addison Wesley, Mar. 2005, ch. 31, 
pp. 493-508. [HMG03] 

[Har05b] HARRIS M.: NVIDIA fluid code sample. 

http : / / download . developer . 
nvidia . com/developer/SDK/ 
Individual_Samples/ samples . 
html#gpgpu_f luid, 2005. 

[HB05] Harris M., Buck I.: GPU flow control id- 
ioms. In GPU Gems 2, Pharr M., (Ed.). Addi- 
son Wesley, Mar. 2005, ch. 34, pp. 547-555. 

[HBSL03] Harris M. J., Baxter III W., Scheuer- 
mann T., LASTRA A.: Simulation of cloud 
dynamics on graphics hardware. In Graphics 
Hardware 2003 (July 2003), pp. 92-101. 

[HCK*99] Hoff III K., Culver T., Keyser J., 
LIN M., MANOCHA D.: Fast computa- 
tion of generalized Voronoi diagrams using 
graphics hardware. In Proceedings of SIG- 
GRAPH 99 (Aug. 1999), Computer Graph- 
ics Proceedings, Annual Conference Series, 
pp. 277-286. 

[HCSL02] Harris M. J., Coombe G., Scheuer- 
mann T., L ASTRA A.: Physically-based 
visual simulation on graphics hardware. 
In Graphics Hardware 2002 (Sept. 2002), 
pp. 109-118. 

[HE99a] HOPF M., ERTL T.: Accelerating 3D con- 
volution using graphics hardware. In IEEE 
Visualization '99 (Oct. 1999), pp. 471-474. 

[HE99b] HOPF M., ERTL T.: Hardware based 
wavelet transformations. In Proceedings of 
Vision, Modeling, and Visualization (1999), 
pp. 317-328. 

[Hei91] HEIDMANNT.: Real shadows real time. IRIS 

Universe, 18 (Nov. 1991), 28-31. 

[HHN*02] Humphreys G., Houston M., No R., 
Frank R., Ahern S., Kirchner P., 
KLOSOWSKI J.: Chromium: A stream- [HZLM01] 
processing framework for interactive ren- 



[Hor05a] 



[Hor05b] 



[HS86] 



[HS99] 



[HTG03] 



[HTG04] 



[HWSE99] 



dering on clusters. ACM Transactions on 
Graphics 21, 3 (July 2002), 693-702. 

HARRIS M. J., JAMES G.: Simulation and 
animation using hardware accelerated proce- 
dural textures. In Proceedings of Game De- 
velopers Conference 2003 (2003). 

HANRAHAN P., KRUEGER W.: Reflection 
from layered surfaces due to subsurface scat- 
tering. In Proceedings of SIGGRAPH 93 
(Aug. 1993), Computer Graphics Proceed- 
ings, Annual Conference Series, pp. 165- 
174. 

HlLLESLAND K. E., MOLINOV S., 
GRZESZCZUK R.: Nonlinear optimization 
framework for image-based modeling on 
programmable graphics hardware. ACM 
Transactions on Graphics 22, 3 (July 2003), 
925-934. 

HORN D.: libgpufft. http: 

/ / sourcef orge . net/projects/ 
gpufft/,2005. 

HORN D.: Stream reduction operations for 
GPGPU applications. In GPU Gems 2, 
Pharr M., (Ed.). Addison Wesley, Mar. 2005, 
ch. 36, pp. 573-589. 

Hillis W. D., Steele Jr. G. L.: Data 
parallel algorithms. Communications of the 
ACM 29, 12 (Dec. 1986), 1170-1183. 

HEIDRICH W., SEIDEL H.-P.: Realis- 
tic, hardware-accelerated shading and light- 
ing. In Proceedings of SIGGRAPH 99 (Aug. 
1999), Computer Graphics Proceedings, An- 
nual Conference Series, pp. 171-178. 

Heidelberger B., Teschner M., 
GROSS M.: Real-time volumetric intersec- 
tions of deforming objects. In Proceedings 
of Vision, Modeling and Visualization (Nov. 
2003), pp. 461^168. 

Heidelberger B., Teschner M., 
GROSS M.: Detection of collisions and 
self-collisions using image-space tech- 
niques. Journal ofWSCG 12, 3 (Feb. 2004), 
145-152. 

HEIDRICH W., WESTERMANN R., SEIDEL 
H.-P., ERTL T: Applications of pixel tex- 
tures in visualization and realistic image syn- 
thesis. In 1999 ACM Symposium on Interac- 
tive 3D Graphics (Apr. 1999), pp. 127-134. 

Hoff III K. E., Zaferakis A., Lin M. C, 
MANOCHA D.: Fast and simple 2D geOmet- 



Cc) The Eurographics Association 2005. 



John D.Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. 
Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on Graphics 
Hardware." In Eurographics 2005, State of the Art Reports, August 2005, pp. 21 -51 . 



Owens, Luebke, Govindaraju, Harris, Krttger, Lefohn, and Purcell I A Survey of General-Purpose Computation on Graphics Hardware Al 



ric proximity queries using graphics hard- 
ware. In 2001 ACM Symposium on Interac- 
tive 3D Graphics (Mar. 2001), pp. 145-148. 

[Ins03] The Insight Toolkit, http://www.itk. 

org/, 2003. 

[Jah05] JAHSHAKA: Jahshaka image processing 

toolkit, http://www.jahshaka.com/, 
2005. 

[JamOla] JAMES G.: NVIDIA game of life sample. 

http : / / download . developer . 
nvidia . com/developer/SDK/ 
Individual_Samples/ samples . 
html#GL_GameOf Lif e, 2001. 

[JamOlb] JAMES G.: NVIDIA water sur- 

face simulation sample. http: 
/ / download . developer . 
nvidia . com/developer/SDK/ 
Individual_Samples/ samples . 
html#Water Interaction, 2001. 

[JamOlc] JAMES G.: Operations for hardware- 
accelerated procedural texture animation. In 
Game Programming Gems 2, Deloura M., 
(Ed.). Charles River Media, 2001, pp. 497- 
509. 

[Je t d04] JEJDRZEJEWSKI M.: Computation of Room 

Acoustics on Programmable Video Hard- 
ware. Master's thesis, Polish- Japanese In- 
stitute of Information Technology, Warsaw, 
Poland, 2004. 

[JEH01] JOBARD B., ERLEBACHER G., HUSSAINI 

M. Y.: Lagrangian-Eulerian advection for 
unsteady flow visualization. In IEEE Visu- 
alization 2001 (Oct. 2001), pp. 53-60. 

[Jen96] JENSEN H. W.: Global illumination using 

photon maps. In Eurographics Rendering 
Workshop 1996 (June 1996), pp. 21-30. 

[JS05] JIANG C, Snir M.: Automatic tuning ma- 

trix multiplication performance on graphics 
hardware. In Proceedings of the Fourteenth 
International Conference on Parallel Archi- 
tecture and Compilation Techniques (PACT) 
(Sept. 2005). To appear. 

[JvHK04] Jansen T, von Rymon-Lipinski B., 
HANSSEN N., Keeve E.: Fourier volume 
rendering on the GPU using a Split-Stream- 
FFT. In Proceedings of Vision, Modeling, 
and Visualization (Nov. 2004), pp. 395-403. 

[KBR04] Kessenich J., Baldwin D., Rost R.: 
The OpenGL Shading Language version 
1.10.59. http://www.opengl.org/ 



documentation/oglsl . html, Apr. 
2004. 

[KI99] KEDEM G., ISHIHARA Y.: Brute force at- 

tack on UNIX passwords with SIMD com- 
puter. In Proceedings of the 8th USENIX Se- 
curity Symposium (Aug. 1999), pp. 93-98. 

[KKKW05] KRUGER J., KlPFER P., KONDRATIEVA P., 
WESTERMANN R.: A particle system for 
interactive visualization of 3D flows. IEEE 
Transactions on Visualization and Computer 
Graphics (2005). To appear. 

[KL03] KIM T., Lin M. C: Visual simulation of ice 

crystal growth. In 2003 ACM SIGGRAPH I 
Eurographics Symposium on Computer Ani- 
mation (Aug. 2003), pp. 86-97. 

[KL04] KARLSSON F., LJUNGSTEDT C. J.: Ray 

tracing filly implemented on programmable 
graphics hardware. Master's thesis, 
Chalmers University of Technology, 2004. 

[KLRS04] KOLB A., LATTA L., Rezk-Salama C: 
Hardware-based simulation and collision de- 
tection for large particle systems. In Graph- 
ics Hardware 2004 (Aug. 2004), pp. 123- 
132. 

[KP03] KNOTT D., Pai D. K.: CInDeR: Collision 

and interference detection in real-time using 
graphics hardware. In Graphics Interface 
(June 2003), pp. 73-80. 

[KSW04] Kipfer P., Segal M., Westermann R.: 
UberFlow: A GPU-based particle engine. 
In Graphics Hardware 2004 (Aug. 2004), 
pp. 115-122. 

[KW03] KRUGER J., WESTERMANN R.: Linear al- 

gebra operators for GPU implementation of 
numerical algorithms. ACM Transactions on 
Graphics 22, 3 (July 2003), 908-916. 

[KW05] Kipfer P., Westermann R.: Improved 
GPU sorting. In GPU Gems 2, Pharr M., 
(Ed.). Addison Wesley, Mar. 2005, ch. 46, 
pp. 733-746. 

[LC04] LARSEN B. D., CHRISTENSEN N. J.: Sim- 

ulating photon mapping for real-time appli- 
cations. In Rendering Techniques 2004: 15th 
Eurographics Workshop on Rendering (June 
2004), pp. 123-132. 

[LCW03] Lefohn A. E., Cates J. E., Whitaker 
R. T.: Interactive, GPU-based level sets for 
3D brain tumor segmentation. In Medical 
Image Computing and Computer Assisted In- 
tervention (MICCAI) (2003), pp. 564-572. 



© The Eurographics Association 2005. 



John D.Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. 
Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on Graphics 
Hardware." In Eurographics 2005, State of the Art Reports, August 2005, pp. 21 -51 . 



48 Owens, Luebke, Govindaraju, Harris, Kriiger, Lefohn, and Purcell I A Survey of General-Purpose Computation on Graphics Hardware 



[Lef03] LEFOHN A. E.: A Streaming Narrow-Band 

Algorithm: Interactive Computation and Vi- 
sualization of Level-Set Surfaces. Master's 
thesis, University of Utah, Dec. 2003. 

[LFWK05] Li W., FAN Z., Wei X., KAUFMAN A.: 
GPU-based flow simulation with complex 
boundaries. In GPU Gems 2, Pharr M., (Ed.). 
Addison Wesley, Mar. 2005, ch. 47, pp. 747- 
764. 

[LHN05] LEFEBVRE S., HORNUS S., NEYRET F.: 
Octree textures on the GPU. In GPU Gems 2, 
Pharr M., (Ed.). Addison Wesley, Mar. 2005, 
ch. 37, pp. 595-613. 

[LHPL87] Levinthal A., Hanrahan P., Paque- 
TTE M., LAWSON J.: Parallel computers for 
graphics applications. ACM SIGOPS Oper- 
ating Systems Review 21, 4 (Oct. 1987), 193- 
198. 

[LKHW03] Lefohn A. E., Kniss J. M., Hansen 
C. D., WHITAKER R. T.: Interactive defor- 
mation and visualization of level set surfaces 
using graphics hardware. In IEEE Visualiza- 
tion 2003 (Oct. 2003), pp. 75-82. 

[LKHW04] Lefohn A. E., Kniss J. M., Hansen 
C. D., WHITAKER R. T.: A stream- 
ing narrow-band algorithm: Interactive com- 
putation and visualization of level-set sur- 
faces. IEEE Transactions on Visualization 
and Computer Graphics 10, 4 (July/Aug. 
2004), 422-433. 

[LKM01] LlNDHOLME., KlLGARD M. J., MORETON 

H.: A user-programmable vertex engine. 
In Proceedings of ACM SIGGRAPH 2001 
(Aug. 2001), Computer Graphics Proceed- 
ings, Annual Conference Series, pp. 149- 
158. 

[LKO05] Lefohn A., Kniss J., Owens J.: Imple- 
menting efficient parallel data structures on 
GPUs. In GPU Gems 2, Pharr M., (Ed.). 
Addison Wesley, Mar. 2005, ch. 33, pp. 521- 
545. 

[LKS*05] Lefohn A. E., Kniss J., Strzodka 
R., Sengupta S., Owens J. D.: Glift: 
Generic, efficient, random-access GPU data 
structures. ACM Transactions on Graphics 
(2005). To appear. 

[LLW04] LIU Y., Liu X., Wu E.: Real-time 3D 
fluid simulation on GPU with complex ob- 
stacles. In Proceedings of Pacific Graphics 
2004 (Oct. 2004), pp. 247-256. 

[LM01] Larsen E. S., McAllister D.: Fast ma- 



[LP84] 



[LRDG90] 



[LSK*05] 



[LW02] 



[LWK03] 



[MA03] 



[Man03] 



[MGAK03] 



[MIA*04] 



trix multiplies using graphics hardware. In 
Proceedings of the 2001 ACM/IEEE Con- 
ference on Supercomputing (New York, NY, 
USA, 2001), ACM Press, p. 55. 



Levinthal A., Porter T 
a SIMD graphics processor. 



[Mic05a] 



Chap - 
In Com- 
puter Graphics (Proceedings of SIGGRAPH 
84) (Minneapolis, Minnesota, July 1984), 
vol. 18, pp. 77-82. 

Lengyel J., Reichert M., Donald 
B. R., GREENBERG D. P.: Real-time robot 
motion planning using rasterizing computer 
graphics hardware. In Computer Graphics 
(Proceedings of ACM SIGGRAPH 90) (Aug. 
1990), vol. 24, pp. 327-335. 

Lefohn A., Sengupta S., Kniss J., Str- 
zodka R., OWENS J. D.: Dynamic adaptive 
shadow maps on graphics hardware. In ACM 
SIGGRAPH 2005 Conference Abstracts and 
Applications (Aug. 2005). To appear. 

Lefohn A. E., Whitaker R. T: A GPU- 
Based, Three-Dimensional Level Set Solver 
with Curvature Flow. Tech. Rep. UUCS-02- 
017, University of Utah, 2002. 

Li W., WEI X., KAUFMAN A.: Imple- 
menting lattice Boltzmann computation on 
graphics hardware. In The Visual Computer 
(2003), vol. 19, pp. 444^56. 

MORELAND K., ANGEL E.: The 
FFT on a GPU. In Graphics Hard- 
ware 2003 (July 2003), pp. 112- 
119. http://www.cs.unm.edu/ 
~kmorel/documents/f f tgpu/. 

MANOCHA D.: Interactive geometric and 
scientific computations using graphics hard- 
ware. ACM SIGGRAPH Course Notes, 1 1 
(2003). 

Mark W. R., Glanville R. S., Akeley 
K., KlLGARD M. J.: Cg: A system for pro- 
gramming graphics hardware in a C-like lan- 
guage. ACM Transactions on Graphics 22, 3 
(July 2003), 896-907. 

MCCORMICK P. S., INMAN J., AHRENS 

J. P., Hansen C, Roth G.: Scout: 
A hardware-accelerated system for quanti- 
tatively driven visualization and analysis. 
In IEEE Visualization 2004 (Oct. 2004), 
pp. 171-178. 

Microsoft high-level shading language, 
http : //msdn .microsoft . com/ 
library/default . asp?url= 



© The Eurographics Association 2005. 



John D.Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kriiger, Aaron E. 
Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on Graphics 
Hardware." In Eurographics 2005, State of the Art Reports, August 2005, pp. 21 -51 . 



Owens, Luebke, Govindaraju, Harris, Krtiger, Lefohn, and Purcell I A Survey of General-Purpose Computation on Graphics Hardware 49 



/library/ en-us/directx9_c/ 
directx/ graphics/reference/ 
hlslref erence/hlslref erence . 
asp, 2005. 



[Ope05] 



[OSW*03] 



[Owe05] 



[PAB*05] 



[Mic05b] Microsoft shader debugger. http: 
/ /msdn . microsoft . com/ 
library/ default . asp?url= 
/library/ en-us/directx9_ 
c /directx/ graphics /Tools/ 
ShaderDebugger . asp, 2005. 

[MM02] MA V. C. H., MCCOOL M. D.: Low la- 
tency photon mapping using block hashing. 
In Graphics Hardware 2002 (Sept. 2002), 
pp. 89-98. 

[MOK95] Myszkowski K., Okunev O. G., Kunii 
T. L.: Fast collision detection between com- 
plex solids using rasterizing graphics hard- 
ware. The Visual Computer 11, 9 (1995), 
497-512. 

[Mor02] MORAVANSZKY A.: Dense matrix algebra 
on the GPU. In Direct3D ShaderX2, Engel 
W. R, (Ed.). Wordware Publishing, 2002. 

[MTP*04] MCCOOL M., TOIT S. D., POPA T., CHAN 
B., MOULEK.: Shader algebra. ACM Trans- 
actions on Graphics 23, 3 (Aug. 2004), 787- [PBMH02] 
795. 

[NHP04] Nyland L., Harris M., Prins J.: N- 
body simulations on a GPU. In Proceedings 
of the ACM Workshop on General-Purpose 
Computation on Graphics Processors (Aug. [PDC*03] 
2004). 

[Nij03] NlJASURE M.: Interactive Global Illumina- 

tion on the Graphics Processing Unit. Mas- 
ter's thesis, University of Central Florida, 
2003. [PH89] 

[OL98] OLANO M., LASTRA A.: A shading lan- 

guage on graphics hardware: The PixelFlow 
shading system. In Proceedings of S1G- 
GRAPH 98 (July 1998), Computer Graph- [POAU00] 
ics Proceedings, Annual Conference Series, 
pp. 159-168. 

[Ope03] OpenGL Architecture Review 
Board: ARB fragment program. Re- 
vision 26. http://oss.sgi.com/ 
pro jects /ogl- sample /regis try/ [PS03] 
ARB/ f ragment_program. txt, 
22 Aug. 2003. 

[Ope04] OpenGL Architecture Review 

Board: ARB vertex program. Re- [Pur04] 
vision 45. http://oss.sgi.com/ 
pro jects /ogl- sample /regis try/ 



ARB/vertex_program.txt, 27 Sept. 
2004. 

OpenVIDIA: GPU-accelerated computer vi- 
sion library. http: //openvidia. 
sourcef orge . net/, 2005. 

OpenGL Architecture Review 
Board, Shreiner D., Woo M., Neider 
J., DAVIS T.: OpenGL Programming Guide: 
The Official Guide to Learning OpenGL. 
Addison- Wesley, 2003. 

OWENS J.: Streaming architectures and tech- 
nology trends. In GPU Gems 2, Pharr M., 
(Ed.). Addison Wesley, Mar. 2005, ch. 29, 
pp. 457^70. 

Pham D., Asano S., Bolliger M., Day 
M. N., Hofstee H. P., Johns C, Kahle 
J., Kameyama A., Keaty J., Masub- 
uchi Y., Riley M., Shippy D., Stasiak 
D., Wang M., Warnock J., Weitzel S., 
Wendel D., Yamazaki T., Yazawa K.: 
The design and implementation of a first- 
generation CELL processor. In Proceedings 
of the International Solid-State Circuits Con- 
ference (Feb. 2005), pp. 184-186. 

Purcell T. J., Buck I., Mark W. R., 
HANRAHAN P.: Ray tracing on pro- 
grammable graphics hardware. ACM Trans- 
actions on Graphics 21, 3 (July 2002), 703- 
712. 

Purcell T. J., Donner C, Cam- 
marano M., Jensen H. W., Hanrahan 
P.: Photon mapping on programmable graph- 
ics hardware. In Graphics Hardware 2003 
(July 2003), pp. 41-50. 

POTMESIL M., HOFFERT E. M.: The Pixel 
Machine: A parallel image computer. In 
Computer Graphics (Proceedings of SIG- 
GRAPH 89) (July 1989), vol. 23, pp. 69-78. 

Peercy M. S., Olano M., Airey J., 
UNGAR P. J.: Interactive multi-pass pro- 
grammable shading. In Proceedings of ACM 
SIGGRAPH 2000 (July 2000), Computer 
Graphics Proceedings, Annual Conference 
Series, pp. 425^32. 

PURCELL T. J., SEN P.: Shade- 
smith fragment program debugger, 
http: / /graphics . Stanford, 
edu/pro jects/shadesmith/, 2003. 

PURCELL T. J.: Ray Tracing on a Stream 
Processor. PhD thesis, Stanford University, 
Mar. 2004. 



© The Eurographics Association 2005. 



John D.Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. 
Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on Graphics 
Hardware." In Eurographics 2005, State of the Art Reports, August 2005, pp. 21 -51 . 



50 Owens, Luebke, Govindaraju, Harris, Kriiger, Lefohn, and Purcell I A Survey of General-Purpose Computation on Graphics Hardware 



[RMS92] Rossignac J., Megahed A., Schnei- 
der B.-O.: Interactive inspection of solids: 
Cross-sections and interferences. In Com- 
puter Graphics (Proceedings of S1GGRAPH 
92) (July 1992), vol. 26, pp. 353-360. 

[RR86] Rossignac J. R., Requicha A. A. G.: 
Depth-buffering display techniques for con- 
structive solid geometry. IEEE Computer 
Graphics & Applications 6, 9 (Sept. 1986), 
29-39. 

[RSOla] RUMPF M., STRZODKA R.: Level set seg- 

mentation in graphics hardware. In Proceed- 
ings of the IEEE International Conference on 
Image Processing (ICIP '01) (2001), vol. 3, 
pp. 1103-1106. 

[RSOlb] RUMPF M., STRZODKA R.: Nonlinear dif- 
fusion in graphics hardware. In Proceedings 
of EG/IEEE TCVG Symposium on Visualiza- 
tion (VisSym '01) (2001), Springer, pp. 75- 
84. 

[RSOlc] RUMPF M., STRZODKA R.: Using graphics 

cards for quantized FEM computations. In 
Proceedings of VHP 2001 (2001), pp. 193- 
202. 

[RS05] RUMPF M., STRZODKA R.: Graphics pro- 

cessor units: New prospects for parallel com- 
puting. In Numerical Solution of Partial Dif- 
ferential Equations on Parallel Computers. 
Springer, 2005. To appear. 

[RSSF02] Reinhard E., Stark M., Shirley P., 
FERWERDA J.: Photographic tone reproduc- 
tion for digital images. ACM Transactions on 
Graphics 21, 3 (July 2002), 267-276. 

[RTB*92] Rhoades J., Turk G., Bell A., State 
A., Neumann U., Varshney A.: Real- 
time procedural textures. In 1992 Symposium 
on Interactive 3D Graphics (Mar. 1992), 
vol. 25, pp. 95-100. 

[SAA03] SUN C, AGRAWAL D., ABBADI A. E.: 
Hardware acceleration for spatial selections 
and joins. In Proceedings of the 2003 ACM 
SIGMOD lnternatioiml Conference on Man- 
agement of Data (June 2003), pp. 455-466. 

[SD02] STAMMINGER M., DRETTAKIS G.: Per- 

spective shadow maps. ACM Transactions 
on Graphics 21, 3 (July 2002), 557-562. 

[SDR03] STRZODKA R., DROSKE M., RUMPF M.: 
Fast image registration in DX9 graphics 
hardware. Journal of Medical Informatics 
and Technologies 6 (Nov. 2003), 43-49. 

[SDR04] STRZODKA R., DROSKE M., RUMPF M.: 



Image registration by a regularized gradient 
flow: A streaming implementation in DX9 
graphics hardware. Computing 73, 4 (Nov. 
2004), 373-389. 

[Sen04] SEN P.: Silhouette maps for improved tex- 

ture magnification. In Graphics Hardware 
2004 (Aug. 2004), pp. 65-74. 

[SF91] SHINYA M., FORGUE M. C: Interference 

detection through rasterization. The Journal 
of Visualization and Computer Animation 2, 
4(1991), 131-134. 

[SG04] STRZODKA R., GARBE C: Real-time mo- 

tion estimation and visualization on graph- 
ics cards. In IEEE Visualization 2004 (Oct. 
2004), pp. 545-552. 

[SHN03] Sherbondy A., Houston M., Napel S.: 
Fast volume segmentation with simultaneous 
visualization using programmable graphics 
hardware. In IEEE Visualization 2003 (Oct. 
2003), pp. 171-176. 

[SKALP05] SZIRMAY-KALOS L., ASZODI B., 
LazAnyi I., PREMECZ M.: Approxi- 
mate ray-tracing on the GPU with distance 
imposters. Computer Graphics Forum 24, 3 
(Sept. 2005). To appear. 

[SKv*92] Segal M., Korobkin C, van Widen- 
FELT R., FORAN J., HAEBERLI P.: Fast 
shadows and lighting effects using texture 
mapping. In Computer Graphics (Proceed- 
ings ofSIGGRAPH 92) (July 1992), vol. 26, 
pp. 249-252. 

[SL05] SUMANAWEERA T., Liu D.: Medical im- 

age reconstruction with the FFT. In GPU 
Gems 2, Pharr M., (Ed.). Addison Wesley, 
Mar. 2005, ch. 48, pp. 765-784. 

[SLJ98] Stewart N., Leach G., John S.: An im- 
proved Z-buffer CSG rendering algorithm. In 
1998 SIGGRAPH I Eurographics Workshop 
on Graphics Hardware (Aug. 1998), pp. 25- 
30. 

[SOM04] SUD A., OTADUY M. A., MANOCHA D.: 
DiFi: Fast 3D distance field computation us- 
ing graphics hardware. Computer Graphics 
Forum 23, 3 (Sept. 2004), 557-566. 

[SPG03] Sigg C, Peikert R., Gross M.: Signed 
distance transform using graphics hardware. 
In IEEE Visualization 2003 (Oct. 2003), 
pp. 83-90. 

[ST04] STRZODKA R., TELEA A.: Generalized 

distance transforms and skeletons in graph- 
ics hardware. In Proceedings of EG/IEEE 



© The Eurographics Association 2005. 



John D.Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. 
Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on Graphics 
Hardware." In Eurographics 2005, State of the Art Reports, August 2005, pp. 21 -51 . 



Owens, Luebke, Govindaraju, Harris, Kriiger, Lefohn, and Purcell I A Survey of General-Purpose Computation on Graphics Hardware 51 



TCVG Symposium on Visualization (VisSym 
•04) (2004), pp. 221-230. 

[STM04] Sander P., Tatarchuk N., Mitchell 
J. L.: Explicit Early-Z Culling for Efficient 
Fluid Flow Simulation and Rendering. 
Tech. rep., ATI Research, Aug. 2004. 
http : //www. ati . com/ developer/ 
techreports / ATITechReport_ 
EarlyZFlow.pdf. 

[Str02] STRZODKA R.: Virtual 16 bit precise opera- 

tions on RGBA8 textures. In Proceedings of 
Vision, Modeling, and Visualization (2002), 
pp. 171-178. 

[Str04] STRZODKA R.: Hardware Efficient PDE 

Solvers in Quantized Image Processing. PhD 
thesis, University of Duisburg-Essen, 2004. 

[THO02] Thompson C. J., Hahn S., Oskin M.: 
Using modern graphics architectures for 
general-purpose computing: A framework 
and analysis. In Proceedings of the 35th 
Annual ACM/IEEE International Symposium 
on Microarchitecture (2002), pp. 306-317. 

[Tre05] TREBILCO D.: GLIntercept. http:// 

glintercept . nutty . org/, 2005. 

[TS00] Trendall C, Stewart A. J.: General 

calculations using graphics hardware, with 
applications to interactive caustics. In Ren- 
dering Techniques 2000: 11th Eurograph- 
ics Workshop on Rendering (June 2000), 
pp. 287-298. 

[Ups90] UPSTILLS.: The RenderMan Companion: A 

Programmer's Guide to Realistic Computer 
Graphics. Addison-Wesley, 1990. 

[Ven03] VENKATASUBRAMANIAN S.: The graphics 

card as a stream computer. In S1GMOD- 
D1MACS Workshop on Management and 
Processing of Data Streams (2003). 

[Ver67] VERLET L.: Computer "experiments" on 

classical fluids. I. Thermodynamical proper- 
ties of Lennard- Jones molecules. Phys. Rev., 
159 (July 1967), 98-103. 

[VKG03] Viola I., Kanitsar A., Groller M. E.: 
Hardware-based nonlinear filtering and seg- 
mentation using high-level shading lan- 
guages. In IEEE Visualization 2003 (Oct. 
2003), pp. 309-316. 

[VSC01] Vassilev T, Spanlang B., Chrysan- 
THOU Y.: Fast cloth animation on walking 
avatars. Computer Graphics Forum 20, 3 
(2001), 260-267. 



[WHE01] WEISKOPF D., Hopf M., Ertl T.: 
Hardware-accelerated visualization of time- 
varying 2D and 3D vector fields by texture 
advection via programmable per-pixel oper- 
ations. In Proceedings of Vision, Modeling, 
and Visualization (2001), pp. 439-446. 

[Whi80] WHITTED T.: An improved illumination 
model for shaded display. Communications 
of the ACM 23, 6 (June 1980), 343-349. 

[WK04] WOETZEL J., KOCH R.: Multi-camera 

real-time depth estimation with discontinu- 
ity handling on PC graphics hardware. In 
Proceedings of the 1 7th International Con- 
ference on Pattern Recognition (Aug. 2004), 
pp. 741-744. 

[WSE04] WEISKOPF D., SCHAFHITZEL T. , ERTL T.: 

GPU-based nonlinear ray tracing. Computer 
Graphics Forum 23, 3 (Sept. 2004), 625- 
633. 

[WWHL05] Wang J., Wong T.-T., Heng P.-A., Le- 
ung C.-S.: Discrete wavelet transform on 
GPU. http://www.cse.cuhk.edu. 
hk/~ttwong/ sof tware/dwtgpu/ 
dwtgpu.html, 2005. 

[XM05] Xu F., MUELLER K.: Accelerating pop- 

ular tomographic reconstruction algorithms 
on commodity PC graphics hardware. IEEE 
Transactions on Nuclear Science (2005). To 
appear. 

[YLPM05] YOON S.-E., LlNDSTROM P., PASCUCCI 
V, MANOCHA D.: Cache-oblivious mesh 
layouts. ACM Transactions on Graphics 24, 
3 (Aug. 2005). To appear. 

[YP05] YANG R., POLLEFEYS M.: A versatile 

stereo implementation on commodity graph- 
ics hardware. Real-Time lnmging 11, 1 (Feb. 
2005), 7-18. 

[YW03] YANG R., Welch G.: Fast image segmenta- 

tion and smoothing using commodity graph- 
ics hardware, journal of graphics tools 7, 4 
(2003), 91-100. 

[Zel05] ZELLER C: Cloth simulation on the GPU. 

In ACM S1GGRAPH 2005 Conference Ab- 
stracts and Applications (Aug. 2005). To ap- 
pear. 



© The Eurographics Association 2005. 



John D.Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. 
Lefohn, and Timothy J. Purcell. "A Survey of General-Purpose Computation on Graphics 
Hardware." In Eurographics 2005, State of the Art Reports, August 2005, pp. 21 -51 . 



